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Classical  reliability  theory,  as  used  in  the  social  sciences,  has 
been  restricted  by  a  model  which  specifies  one  undifferentiated  error 
component.   This  restriction  has  limited  the  applicability  of  the  model 
and  has  obscured  its  interpretation.   Recent  advancements  in  psychometric 
theory  provide  more  flexible  models  which  permit  the  investigation  of 
multiple  sources  of  error  variation.   Under  the  rubric  of  generalizability 
theory,  these  methods  are  based  on  R.  A.  Fisher's  work  on  the  analysis  of 
variance  and  the  factorial  experiment. 

Generalizability  theory  is  potentially  very  useful  in  many  areas 
of  research  suffering  from  inconsistency  of  measurement.   In  particular, 
the  theory  is  applicable  to  the  assessment  of  writing  ability  from 
written  compositions.   However,  applied  studies  in  this  area  are  lacking. 

The  literature  on  the  measurement  of  writing  ability  has  identified 
several  sources  of  error  affecting  the  reliability  of  written  compositions. 


The  most  common  sources  of  error  noted  are  inconsistency  across  raters, 
modes,  and  occasions.   In  spite  of  the  recognition  of  these  sources  of 
variation,  most  researchers  who  have  studied  the  reliability  of  written 
composition  have  examined  the  issue  only  in  terms  of  inter-rater 
reliability.   Implicit  in  the  concept  of  inter-rater  reliability  is  the 
assumption  that  fluctuations  among  raters  are  the  only  errors  in  the 
model.   This  study  incorporated  three  facets:   raters,  modes,  and 
occasions,  in  a  split-plot  factorial  design  in  order  to  examine  the 
results  obtained  by  taking  into  account  more  than  one  source  of  error 
through  the  methodology  of  generalizability  theory. 

Samples  of  writing  from  104  fourth  graders  were  obtained  under 
selected  mode  and  occasion  conditions.   Each  sample  was  scored  by  four 
trained  raters.   In  the  design,  the  students  were  considered  as  nested 
within  a  higher  classification,  the  classes.   The  number  of  students  in 
each  class  was  not  constant.   Therefore,  this  study  also  extended  the 
principles  of  generalizability  theory  to  unbalanced  designs. 

Point  estimates  of  the  variance  components  for  all  effects  in  the 
model  were  obtained  through  the  MIVQUE  method.   Negative  estimates  were 
replaced  by  zeros.   The  relative  magnitude  of  the  estimates  indicated 
that  students  could  be  differentiated  on  the  basis  of  their  ratings. 
However,  the  classes  as  units  could  not  be  distinguished.   The  estimates 
also  showed  that  errors  resulting  from  variability  in  the  quality  of 
writing  across  occasions  and  modes  outweigh  those  stemming  from  differences 
among  raters.   Furthermore,  occasions  represented  a  greater  source  of 
error  than  modes.   With  training  and  practice,  raters  can  consistently 
score  the  writing  samples  of  students  using  a  general  impression  method. 


Assuming  homogeneity  of  variance,  unbiased  generalizability  coeffi- 
cients were  obtained  for  seven  universes  of  generalization.  These 
universes  represented  generalization  across  one  facet,  two  facets,  or 
all  three  facets  simultaneously.   The  coefficients  indicated  that,  to 
obtain  acceptable  levels  of  generalizability,  at  least  six  samples  of 
writing  from  each  person  are  necessary. 

The  standard  error  of  measurement  which  may  be  used  in  constructing 
confidence  intervals  around  a  person's  universe  score  was  also  examined. 
The  results  from  this  examination  paralleled  those  based  on  the 
generalizability  coefficients. 

A  supplementary  analysis  which  allowed  a  comparison  of  the  estimates 
obtained  through  the  MIVQUE  method  to  those  derived  using  expected  mean 
squares,  resulted  in  similar  values  for  all  estimates  in  a  model  without 
the  classes  effect.   These  results  were  interpreted  as  lending  support 
to  the  MIVQUE  method. 

It  was  concluded  that  generalizability  theory  is  very  useful  for 
clarifying  problems  in  estimating  reliability  in  the  area  of  writing 
ability.   Furthermore,  the  theory  need  not  be  limited  to  situations  with 
balanced  data.   Valid  methods  of  variance  component  estimation  documented 
in  the  statistical  literature  may  be  used  with  unbalanced  designs. 


CHAPTER  I 
INTRODUCTION 

The  concept  of  reliability  in  educational  research  has  undergone 
notable  refinements  with  a  resulting  increase  in  clarity  and  applica- 
bility.  However,  these  conceptual  developments  have  not  been  matched 
by  applications  in  many  content  areas.   For  example,  the  reliability 
of  essay  tests  still  represents  a  confusing  issue,  partly  because  of 
the  continued  use  of  the  classical  model  for  its  investigation.   This 
study  represents  an  attempt  to  "bridge  the  gap"  between  some  recognized 
methodological  needs  in  the  field  of  written  language  arts  and  advance- 
ments in  measurement  theory. 

Classical  reliability  theory  has  been  based  on  a  model  (originated 
by  Spearman  in  1904)  which  states  that  a  person's  observed  score  is  the 
sum  of  a  true  score  component  and  an  undifferentiated  error  component 
as  shown  below: 

(1)  X  =  Tf  +  e  . 

The  true  and  error  components  are  assumed  to  be  independent  of  each 
other.   Therefore,  the  variance  of  the  observed  scores  for  a  group  of 
individuals  can  be  partitioned  into  the  sum  of  independent  variance 
components  as  shown  in  equation  (2) . 

(2)  a-2  =  a2-  +  o-2  . 

The  reliability  of  a  test  is  then  defined  as  the  ratio  of  the  true  score 

variance  to  observed  score  variance. 
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Since  a%   and  ae  are  unknown,  in  practice  reliability  is  estimated 

by  computing  the  correlation  between  parallel  forms  of  the  test.   In 

order  for  tests  to  be  parallel,  they  must  have  equal  means,  equal 

variances,  and  equal  intercorrelations  among  items.   From  these 

restrictions  and  the  assumptions  imposed  on  the  model  (1),  it  can  be 

shown  that  if  two  tests  are  parallel,  their  correlation  equals  (3) 

above  [for  proof  see  Magnusson,  1967). 

In  addition  to  the  restriction  of  parallelism,  classical  theory 

considers  the  error  component  to  he  undifferentiated,  that  is,  various 

sources  of  inconsistency  which  may  affect  the  reliability  of  the  test 

are  grouped  together  in  a  single  error  term.   Different  procedures  for 

constructing  parallel  tests  (e.g.  test-retest,  split-half)  make  different 

assumptions  about  what  constitutes  the  source  of  error  in  the  model. 

Therefore,  following  the  classical  model,  more  than  one  interpretation 

of  the  same  error  component  is  possible. 

The  limitations  of  the  classical  model  mentioned  above  render  it 

inefficient  in  many  real  life  situations  for  several  reasons.   First, 

the  condition  of  parallelism  is  seldom  met  in  the  real  world.   It  is 

common  to  find  that  supposedly  parallel  tests  have  different  means. 

When  tests  have  different  means,  the  formulas  which  assume  equality 

provide  an  underestimate  of  the  reliability  (Ebel ,  1951).   Second,  by 

including  only  one  error  component  which  changes  in  meaning  depending 

on  the  method  of  obtaining  parallel  forms  of  the  test,  the  classical 

model  can  lead  to  some  confusion.   Unless  the  type  of  coefficient  is 

reported,  the  model  provides  no  clues  for  the  interpretation  of  the 

error  component.  Finally,  when  more  than  one  coefficient  is  desired 


under  the  classical  model,  more  than  one  study  must  he  conducted.  As  a 
result,  the  model  does  not  allow  for  the  consideration  of  error  result- 
ing from  interactions  among  sources. 

In  order  to  overcome  these  deficits  inherent  in  the  classical  model, 
some  measurement  specialists  have  adopted  R.  A.  Fisher's  conceptualiza- 
tion of  the  factorial  experiment,  a  method  of  classifying  observations 
along  more  than  one  dimension;  and  the  analysis  of  variance,  a  procedure 
which  partitions  total  variability  into  identifiable  sources.   These  two 
powerful  tools  have  allowed  for  the  possibility  of  releasing  the  restric- 
tion of  parallelism  and  have  provided  a  systematic  approach  to  the  simul- 
taneous consideration  of  multiple  sources  of  error  variation. 

The  applicability  of  these  concepts  to  social  science  research  and 
specifically  to  the  reliability  problem  was  explicitly  discussed  by 
Lindquist  (1953).  Since  then,  these  techniques  have  been  widely  used  in 
testing  hypotheses  about  group  differences  but  only  rarely  in  assessing 
reliability. 

More  recently,  Cronbach  and  his  colleagues  have  assembled  all  of  the 
work  which  has  been  done  along  these  lines  under  the  rubric  of  generali- 
zability  theory.   The  synthesis  of  their  efforts  is  described  in  their 
1972  book  entitled  The  Dependability  of  Behavioral  Measurements:  Theory 
of  Generalizability  for  Scores  and  Profiles.  Basically,  generalizability 
theory  uses  the  analysis  of  variance  approach  in  the  estimation  of  relia- 
bility.  Rather  than  emphasizing  the  computation  of  reliability  coeffi- 
cients as  the  classical  theory  does,  the  emphasis  is  on  the  estimation  of 
variance  components  for  all  identifiable  sources  incorporated  into  the 
design.   The  theory  allows  for  unequal  means,  decomposes  the  error  term 
into  separate  sources,  and  requires  the  explicit  consideration  of  the 
factors  identifying  the  population  of  measures  being  studied. 


If  desired,  the  variance  components  can  be  used  in  the  computation 
of  "generalizability  coefficients."  These  are  intraclass  correlations 
analogous  to  reliability  coefficients.   Within  a  less  restrictive  model 
which  partitions  the  variance  into  several  sources,  more  than  one 
coefficient  is  possible  from  just  one  study.   One  of  the  most  important 
advantages  of  this  approach  is  that  the  analysis  of  variance  technique 
can  be  applied  to  many  different  types  of  experimental  designs.   When 
the  levels  of  the  factors  included  in  the  design  result  from  a  factorial 
experiment,  the  analysis  of  variance  can  provide  estimates  of  the 
variability  due  to  interactions  among  factors.   As  was  previously  noted, 
these  interactions  were  undetectable  under  the  classical  model. 
The  Terminology  of  Generalizability  Theory 
Generalizability  theory  is  considered  by  its  developers  as  an 
extension  and  liberalization  of  classical  reliability  theory.   An 
important  distinction  is  made  between  two  kinds  of  studies:   G  and  D. 
A  G- study  or  generalizability  study  is  one  where  the  sources  and 
magnitude  of  the  variability  in  one  particular  measurement  instrument 
are  investigated.   A  G-study  is  analogous  to  a  reliability  study  in  the 
classical  sense. 

A  D-study  or  decision  study  is  one  which  uses  information  concerning 
the  generalizability  of  a  specific  measurement  tool  for  decision  making 
purposes.   Two  types  of  decisions  are  identified:   absolute  or  comparative. 
Absolute  decisions  are  those  which  consider  each  individual  separately. 
Placement  and  classification  decisions  are  both  absolute  decisions.   A 
specific  example  would  be  a  decision  made  by  a  guidance  counselor  to 
place  a  student  in  one  of  several  curriculum  programs  on  the  basis  of 
the  student's  score  on  a  test.   A  comparative  decision  is  based  on  a 


comparison  of  one  individual  to  another  or  a  comparison  among  groups 
of  individuals.   Selection  decisions,  as  well  as  decisions  involving 
group  differences,  fall  under  this  category.   An  example  of  a  compara- 
tive decision  occurs  when  the  scores  on  a  test  are  used  as  the  depen- 
dent variable  for  comparing  the  performance  of  two  groups  participating 
in  an  experiment. 

In  generalizability  theory  an  observation  is  considered  to  be  a 
sample  from  the  total  universe  of  observations  which  could  have  been 
made.  The  observation  is  described  in  terms  of  the  conditions  under 
which  it  is  made.  Two  or  more  conditions  of  the  same  type  constitute 
a  facet-  witn  one  exception,  in  Fisherian  terms  a  facet  is  a  factor; 
conditions  are  simply  the  levels  of  a  factor.  The  exception  is  that 
"persons"  is  never  considered  a  facet  in  a  G-study  even  though  it  is 
a  factor. 

When  conducting  a  G-study,  the  investigator  should  include  as  many 
of  the  facets  which  are  considered  to  affect  the  reliability  of  the 
measure  as  possible.   From  each  facet,  the  investigator  samples  a  set 
of  conditions  under  a  particular  design.   The  observations  are  then 
made  under  the  set  of  conditions  sampled.   The  set  of  all  possible  ob- 
servations which  could  be  included  in  the  G-study  is  referred  to  as  the 
universe  of  admissible  observations. 

When  the  sampling  of  conditions  is  done  at  random  and  the  universe 
of  conditions  is  sufficiently  large,  the  investigator  is  operating  under 
a  random  effects  model.   This  model  is  the  one  most  commonly  used  in 
the  context  of  generalizability  theory  although  fixed  and  mixed  models 
are  also  possible. 


Regardless  of  the  model  used,  the  point  estimates  of  the  variance 
components  are  obtained  by  computing  the  mean  squares  from  the  analysis 
of  variance  and  setting  them  equal  to  their  corresponding  expected  mean 
squares.   A  helpful  but  unnecessary  restriction  has  been  made  in  gene- 
ralizability  theory  that  equal  numbers  of  observations  appear  in  the 
subclassifications  of  the  design.   This  restriction  simplifies  the  pro- 
cedure for  obtaining  point  estimates  of  the  variance  components  but  is 
not  absolutely  necessary.   With  equal  numbers  of  observations,  the  mean 
squares  from  the  analysis  of  variance  are  unique  and  are  "the  best" 
estimates  possible.   Having  unequal  numbers  of  observations  creates  a 
situation  where  the  investigator  must  decide  which  of  several  sums  of 
squares  (and,  therefore,  mean  squares)  to  use.   In  either  case,  the 
expected  values  of  the  sums  of  squares  are  linear  combinations  of  the 
variance  components.   Therefore,  solving  for  a  set  of  simultaneous 
equations  will  result  in  point  estimates. 

The  estimates  of  the  variance  components  obtained  under  a  G-study 
can  be  used  in  subsequent  D-studies  as  long  as  the  facets  included  in 
the  D-study  were  also  included  in  the  G-study.   The  conditions,  however, 
do  not  have  to  be  the  same  if  a  random  effects  model  is  being  considered. 

The  set  of  all  possible  observations  to  which  an  investigator 
carrying  out  a  D-study  wishes  to  generalize  is  termed  the  universe  of 
generalization.   This  universe  must  then  be  a  subset  of  the  universe  of 
admissible  observations  of  the  G-study  providing  the  variance  component 
estimates.   This  relationship  between  a  G-study  and  subsequent  D-studies 
implies  that  the  utility  of  a  G-study  depends  upon  its  ability  to  provide 
estimates  of  as  many  components  of  variance  as  might  arise  in  future  D- 
studies.   That  is,  a  G-study  providing  estimates  of  three  components  of 


variance  is  more  useful  than  one  where  only  one  of  those  three  components 
is  estimable,  all  other  things  being  equal. 
Purpose  of  the  Study 

The  purpose  of  this  study  was  to  apply  the  principles  of  generaliza- 
bility  theory  to  the  assessment  of  written  composition.   The  significance 
of  this  study  is  twofold.   First,  it  illustrates  how  theoretical  measure- 
ment concepts  can  be  applied  and  extended  to  fit  specific  problems 
encountered  in  assessment.   Second,  it  provides  guidelines  to  applied 
researchers  and  evaluators  in  the  field  of  writing  for  improved  methods 
of  estimating  the  reliability  of  their  assessment  procedures. 

With  the  current  movement  toward  teaching  basic  skills,  the  effec- 
tiveness of  tests  in  assessing  progress  in  reading,  writing,  and 
arithmetic  is  under  scrutiny.   Of  these  three  areas,  writing  presents 
a  paradoxical  conflict.   While  objective  tests  of  writing  ability  are 
generally  more  reliable,  essay  tests  or  written  compositions  are  con- 
sidered to  be  more  valid  measures  of  writing  ability  (Coffman,  1971). 
The  opinion  of  most  specialists  in  the  field  of  language  arts  is  that 
the  validity  of  essay  tests  should  not  be  traded  for  the  higher  relia- 
bility of  objective  tests  (McColly,  1970} . 

Given  this  preference  for  the  essay  test  in  the  assessment  of 
writing  skill,  any  efforts  to  improve  the  quality  of  measurement  in 
this  area  should  focus  on  this  test  form.   Unfortunately,  advancements 
in  measurement  theory  and  practice  have  been,  for  the  most  part, 
restricted  to  objective  tests.   However,  generalizability  theory  offers 
great  potential  usefulness  for  upgrading  the  reliability  of  measures  of 
written  composition.   Applied  studies  are  needed  to  test  the  reality  of 
that  potential. 


Statement  o f  the  Probl em 
The  need  to  study  the  application  of  generalizability  theory  to 
the  assessment  of  writing  skills  becomes  apparent  when  the  recommen- 
dations on  research  methodology  from  leading  curriculum  specialists  in 
written  language  arts  are  examined.   Although  the  specific  recommenda- 
tions will  be  discussed  in  the  following  chapter,  at  this  point  we  will 
note  that  more  than  one  source  of  variability  affecting  the  reliability 
of  written  composition  has  been  identified.   The  most  common  sources  of 
error  noted  are  inconsistency  across  raters,  modes,  and  occasions. 

In  spite  of  the  recognition  of  these  sources  of  variation,  most 
researchers  who  have  studied  the  reliability  of  written  composition 
have  examined  the  issue  only  in  terms  of  inter-rater  reliability. 
Implicit  to  the  concept  of  inter-rater  reliability  is  the  assumption  that 
fluctuations  among  raters  is  the  only  source  of  error  in  the  model. 
This  study  incorporated  three  facets  in  a  split-plot  factorial  design 
in  order  to  examine  the  results  obtained  by  taking  into  account  more 
than  one  source  of  error.   Following  the  recommendations  of  Brennan  (1975) 
the  students  were  seen  as  nested  in  a  higher  classification,  the  classes. 
The  number  of  students  in  each  class  was  not  constant.   Therefore,  this 
study  also  explored  procedures  for  obtaining  estimates  of  the  variance 
components  which  are  applicable  to  unbalanced  designs.   An  unbalanced 
design  as  defined  here  is  one  with  unequal  numbers  of  observations  in 
the  subclassifications  (Searle,  1971a). 

Using  the  results  from  this  study,  we  will  be  able  to  assess  the 
magnitude  of  each  source  of  variability  and  determine  which  ones  are 
the  most  important  to  control  in  order  to  obtain  reliable  assessments 


of  writing  skill.   Based  on  these  results,  we  will  be  able  to  make 
recommendations  for  the  design  of  future  D-studies  using  the  same  method 
of  assessment.   These  recommendations  will  include  the  nature  of  the 
facets  which  must  be  considered  as  well  as  the  frequency  with  which 
each  facet  should  be  sampled.   Both  absolute  and  comparative  decisions 
will  be  taken  into  account. 

In  addition,  the  estimates  of  the  variance  components  will  be  used 
in  the  computation  of  several  generalizability  coefficients.   The 
coefficients  to  be  considered  are  those  which  provide  estimates  of  the 
reliability  when  generalization  is  intended  in  either  one  dimension 
(across  raters,  modes,  or  occasions),  two  dimensions  (raters  and  modes, 
etc.),  or  three  dimensions  (raters,  modes,  and  occasions). 
Significance  of  the  Study 

A  renewed  nationwide  interest  in  the  assessment  of  writing  may  be 
evidenced  by  the  following  events: 

1.  A  compositional  writing  subtest  is  being  reinstated  on  the 
Scholastic  Aptitude  Test  (SAT)  examination  used  by  many  colleges  and 
universities  for  student  selection  and  placement. 

2.  Interest  in  expanding  the  base  of  knowledge  on  the  writing 
process  has  been  underscored  by  National  Institute  of  Education  (NIE) 
in  the  1978  competition  for  Basic  Skills  Awards. 

3.  A  number  of  states  now  include  writing  as  a  skill  to  be 
tested  in  their  efforts  to  establish  statewide  standards  for  minimum 
educational  competency. 

Those  responsible  for  the  preparation  of  these  examinations  will 
naturally  be  governed  by  practical  considerations,  such  as  the  demon- 
strable quality  of  those  examinations.   Demonstrating  the  reliability 
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of  their  techniques  must  be  one  of  the  considerations.   Generalizability 
coefficients,  which  take  into  account  various  sources  of  error  associated 
with  writing  assessment,  provide  unambiguous  estimates  of  reliability. 
As  a  result,  they  are  preferable  to  the  traditional  inter-rater 
correlation  coefficient. 


CHAPTER  II 
REVIEW  OF  THE  LITERATURE 

The  literature  reviewed  in  this  chapter  has  been  selected  from 
three  distinct  fields:   language  arts,  measurement  theory,  and  sta- 
tistical methodology.   The  review  is  organized  in  the  following  manner. 
First,  selected  literature  pertinent  to  the  assessment  of  writing 
ability  is  presented  to  establish  the  rationale  for  the  content  area 
of  this  study.   Particular  attention  will  be  given  to  studies  involving 
primary  grade  children.   Next,  the  development  of  generalizability 
theory  is  traced,  followed  by  references  illustrating  applications  of 
the  theory.   Finally,  selected  references  from  the  literature  on  methods 
of  variance  component  estimation  are  reviewed  with  emphasis  on  methods 
that  are  applicable  to  unbalanced  designs.   These  designs  have  greatest 
utility  in  determining  generalizability  coefficients  for  assessments 
of  written  composition. 

The  Assessment  of  Writing  Ability 

Teachers  and  nonteachers  alike  would  agree  that  writing  is  one  of 
the  most  important  subjects  taught  in  schools.   But  the  importance  of 
the  subject  has  not  been  accompanied  by  effective  assessment.   Evaluating 
students'  writing  performance  continues  to  be  a  problem  for  writing 
specialists,  English  teachers,  and  researchers  investigating  this 
complex  area.   Both  objective  tests  and  compositional  writing  (essay 
tests)  continue  to  be  used  (Coffman,  1971).   However,  the  balance  seems 
to  be  on  the  side  of  essay  tests.   After  reviewing  several  standardized 
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objective  tests  of  writing,  McCaig  (1977)  recommended  "to  evaluate 
achievement  in  writing,  evaluate  the  writing  of  children" (p. 491) . 

Several  other  experts  in  the  field  also  agree  that  writing  ability 
is  best  determined  by  looking  at  actual  writing  performance  (Coffman, 
1971  §  McColly,  1970).   The  members  of  a  recent  Louisiana  State  Depart- 
ment of  Education  conference  on  minimum  writing  proficiency  unanimously 
recommended  that  any  test  of  writing  proficiency  include  a  sample  of 
the  student's  writing  (Suhor,  1977). 

Objective  tests  are  generally  not  recommended.   A  quote  from 

Braddock  (1976)  emphasizes  the  point: 

At  this  stage  of  our  understanding  of  writing 
and  of  testing,  it  is  difficult  to  believe 
that  any  standardized  test  will  be  constructed 
which  can  measure  such  ability.  Therefore,  anyone 
who  professes  to  evaluate  "writing  ability"  with 
a  standardized  test  is  either  telling  a  false- 
hood or  speaking  from  ignorance. (p. 119) 

At  the  present  time,  essay  tests  are  included  in  a  number  of  com- 
monly used  tests  of  English.   Examples  of  these  are  the  Language  Skills 
Examination,  the  College  Entrance  Examination  Board,  and  the  writing 
test  developed  by  NAEP.   These  may  be  used  for  the  prediction  of 
success  in  English,  placement  in  special  courses,  exemption  from  required 
courses,  program  evaluation,  and  experimental  or  correlational  research 
(Cooper  and  Odell,  1977). 

Sources  of  Error  in  Essay  Tests 

The  problem  of  the  reliability  of  essay  tests  has  been  widely 
recognized  for  some  time  (Meckel,  1963).   Adequate  reliability  is 
particularly  important  in  required  writing  courses  in  which  students 
must  earn  a  satisfactory  grade  and  also  in  research,  when  essay  tests 
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are  used  as  a  measure  of  gains  or  losses  in  skill  which  are  to  be  attrib- 
uted to  experiments  in  teaching  methods. 

Diederich  (1957)  suggested  that  the  major  problem  of  grading  essays 
has  to  do  with  variation  in  the  grades  assigned  by  different  readers. 
Commenting  on  the  difficulties  involved  in  grading  such  tests,  he 
pointed  out  that  when  10  readers  read  a  set  of  papers  without  discussing 
standards,  it  is  likely  that  average  papers  will  receive  the  whole 
range  of  grades.   He  suggested  three  criteria  for  judging  essay  tests  of 
writing  ability.   First,  the  writing  assignment  should  be  like  the 
writing  students  do  in  the  normal  course  of  events.   Second,  the  grading 
should  be  independent  of  the  writer's  knowledge  of  the  subject  matter. 
Finally,  the  topic  must  be  within  the  student's  comprehension.   These 
criteria  were  met  in  the  selection  of  assignment  and  in  the  grading  of 
the  samples  used  in  this  study. 

To  improve  the  reliability  of  essays,  he  recommended  that  all 
students  write  on  the  same  topic,  that  readers  be  trained,  and  that  at 
least  two  samples  of  writing  be  obtained  from  each  student.   This  last 
recommendation  suggests  a  second  source  of  variation  related  to  the 
reliability  problem.   Meckel  was  aware  of  this  source  when  he  said: 
"samples  of  writing  done  over  a  semester  are  obviously  a  better  index 
of  writing  ability  than  a  single  essay"  (p. 988). 

Braddock,  Floyd-Jones,  and  Schoer  (1963),  after  screening  and 
reviewing  484  studies  on  writing,  discussed  four  sources  of  variation 
which  should  be  taken  into  account  when  rating  compositions.   These 
sources  are:   the  writer  variable,  the  assignment  variable,  the  rater 
variable,  and  the  colleague  variable.   The  writer  variable  refers  to 
day-to-day  fluctuations  in  the  writing  performance  of  individuals, 


14 


particularly  the  performance  of  better  writers.   On  this  issue  these 
authors  recommend  that  each  student  write  at  least  twice. 

Under  the  assignment  variable,  Braddock  et  al.  included  four 
aspects:   topic,  mode,  time,  and  situation.   They  hypothesized  that 
variation  in  mode  may  have  a  stronger  effect  on  the  quality  of  writing 
than  variation  in  topic.   The  modes  considered  by  these  authors  were: 
narration,  description,  exposition,  argument,  and  criticism.   With 
respect  to  time  and  condition,  their  recommendation  was  to  allow  as  much 
as  20  to  30  minutes  of  writing  time  for  primary  grade  children  and  to 
standardize  the  conditions  across  all  children. 

The  rater  variable,  as  defined  by  Braddock  et  al.,  refers  to  the 
tendency  of  a  rater  to  vary  in  his/her  own  standards  of  evaluation  while 
the  colleague  variable  refers  to  variation  in  standards  across  different 
raters.  The  existence  of  inter-rater  variability  has  been  substantiated 
very  frequently  by  research.  Braddock  et  al.  recommended  that  the  raters 
have  a  common  set  of  criteria  and  that  they  practice  together  in  applying 
those  criteria  consistently.  Two  additional  recommendations  were  offered 
in  order  to  reduce  the  inter-rater  variation.  One  of  them  was  to  preserve 
the  anonymity  of  the  writer.  (These  recommendations  were  previously  made 
by  Diederich).  The  second  one  was  to  control  for  rater  fatigue.  As  will 
be  shown  in  the  next  chapter,  these  recommendations  were  followed  in  the 
rating  of  the  samples  used  in  this  study. 

McColly  (1970)  categorized  the  sources  of  error  in  grading  essay 
tests  of  writing  ability  into  three  general  sources:  students,  readers, 
and  topics.  In  determining  his  classification  scheme,  he  considered  the 
categories  offered  by  Braddock  et  al .  as  well  as  those  proposed  by 
French  (1962).   French's  categories,  almost  identical  to  McColly's, 
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consist  of  student  errors,  test  errors  (the  task  and  the  topic),  and 
scale  errors  (reader  disagreement) . 

Under  the  student  source,  McColly  considered  conditions  such  as 
distractions  (both  internal  and  external)  as  well  as  the  motivation  of 
the  student.   He  recommended  allowing  the  student  at  least  40  to  45 
minutes  of  writing  time. 

With  respect  to  readers,  McColly  concurred  that  readers  must  be 
given  the  proper  training  and  orientation  as  well  as  the  opportunity  to 
practice.   Practice  is  indispensable  in  establishing  the  proper  speed 
and  rate.   He  makes  the  following  general  statement  in  this  regard: 
"up  to  the  point  where  the  prose  becomes  ununderstandable,  the  faster 
the  rate  and  speed,  the  more  valid  and  reliable  the  judgement" (p. 150) . 

As  far  as  the  topic  is  concerned,  McColly  discussed  the  relation- 
ship between  assessing  writing  ability  and  structuring  the  assignment. 
In  his  view,  by  providing  students  with  the  content  in  a  writing  test, 
one  is  filtering  out,  to  some  extent,  the  factor  of  subject  matter 
mastery.   On  the  other  hand,  when  all  of  the  content  is  provided, 
writing  becomes  simply  an  excercise  in  logic.   He  concluded  that  more 
experimentation  is  needed  in  this  area  in  order  to  determine  to  what 
extent  content  should  be  provided  in  assessing  writing  ability  and  not 
knowledge  of  subject  matter  nor  logic. 

It  is  important  to  make  a  distinction  between  the  use  of  the  essay 
to  assess  ability  to  communicate  within  a  subject  area  and  the  use  of 
written  compositions  to  assess  ability  to  write.   Coffman  (1971)  has 
addressed  the  former,  but  some  of  his  ideas  are  relevant  to  the  latter 
use.   In  particular,  Coffman' s  chapter  deals  with  the  essay  examinations 
when  it  is  used  by  individual  teachers  in  measuring  the  outcome  of 
instruction. 
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In  his  chapter,  Coffman  considered  three  sources  of  error  affecting  essay 
scores:   inter-rater  variability,  intra-rater  variability,  and  freedom 
of  responses.   Not  all  three  sources  are  pertinent  to  all  uses  of  the 
essay.   The  last  source  is  related  to  McColly's  concern  on  the  structure 
of  the  assignment.   According  to  Coffman,  if  ratings  are  used  only  to 
determine  the  rank  order  of  the  pupils,  only  the  first  source  of  error 
is  of  concern.   However,  if  the  ratings  are  treated  as  direct  measures 
of  quality,  then  all  sources  of  error  become  critical. 

More  recently, Cooper  and  Odell  (1977)  have  noted  that  to  obtain 
reliable  measures  of  writing  ability  through  essay  tests,  it  is  necessary 
to  have  more  than  one  piece  of  writing  from  more  than  one  occasion  and 
involving  two  or  more  persons  in  rating  each  piece.   Thus,  these  authors 
implied  that  raters,  occasions,  and  assignment  are  sources  of  error. 

A  line  of  empirical  studies  addressing  the  issue  of  factors  affecting 
specifically  the  writing  of  children  clearly  points  out  that  writing 
mode  is  an  important  source  of  variation.   Seegars,  as  early  as  1933, 
cautioned  teachers  and  researchers  to  be  alert  to  the  different  impacts 
of  the  modes  in  evaluating  and  analyzing  children's  writing.   Several 
experimental  studies  conducted  in  the  60's  generally  support  Seegars' 
contention  in  samples  of  first  and  third  grade  children  (Johnson,  1967; 
Anderson  and  Bashaw,  1968).   More  recent  studies  offer  added  evidence 
that  the  mode  is  related  to  the  quality  of  children's  writing  (Bortz, 
1970;  Veal  and  Tillman,  1971;  Pope,  1974;  Perron,  1976). 

In  most  of  these  studies,  a  measure  of  syntactic  complexity  such  as 
number  of  clauses  or  number  of  words  per  clause  was  used  as  the  dependent 
variable.   The  modes  investigated  were  descriptive,  argumentative,  narra- 
tive, and  expository. 
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In  spite  of  the  recognition  that  occasion  variability,  assignment 
variability,  and  mode  variability  are  sources  of  error  in  assessing 
writing  ability,  most  researchers  who  study  compositional  writing  have 
considered  the  issue  of  instrument  reliability  in  terms  of  inter-rater 
reliability.   For  example,  Cohen  (1973)  in  evaluating  the  writing  ability 
of  college  students,  determined  reliability  using  percentage  of  agree- 
ment among  raters.   When  Fagan,  Cooper,  and  Jensen  (1975)  reviewed 
several  available  measures  for  evaluation  and  research  in  written 
language  arts,  inter-rater  reliability  or  percentage  of  agreement  between 
raters  constituted  the  most  common  type  of  reliability  estimates 
reported.   The  only  other  type  of  estimate,  reported  in  only  two  cases, 
was  test-retest  reliability.   More  recent  investigations  of  the  relia- 
bility of  specific  instruments  equate  reliability  with  agreement  across 
raters.   An  example  is  Singleton's  (1977)  dissertation  on  the  reliability 
of  ratings  assigned  on  the  essay  portion  of  the  Language  Skills  Examina- 
tion. 

It  seems  that  essay  test  reliability  has  practically  become  synony- 
mous with  inter-rater  reliability.   A  likely  explanation  for  this  phenom- 
enon is  that  non-statistical  psychologists  find  it  easier  to  think  in 
terms  of  correlations.   A  Pearson  product-moment  correlation  coefficient 
may  be  easily  computed  between  the  scores  assigned  by  two  raters.   But 
this  correlation  coefficient  does  not  adequately  assess  all  of  the  sources 
of  variation  (Coffman,  1971). 

Coffman  suggested  using  the  analysis  of  variance  approach  to 
adequately  assess  more  than  one  source  of  error  variation.   Stanley 
(1962)  had  previously  discussed  a  specific  design  which  could  be  used 
to  assess  the  reliability  of  raters  and  test  forms. 


A  classic  study  by  Finlayson  (1951)  is  the  first  reliability  study 
to  consider  rater  and  test  variability  as  sources  of  error  in  essays. 
Based  on  a  sample  of  197  children  who  wrote  two  essays,  he  reported 
mean  coefficients  of  .697  and  .810  for  the  reliability  across  tests  and 
raters,  respectively.   Each  essay  was  rated  by  six  raters,  using  a 
general  impression  method  of  scoring  with  a  1  to  5  scale.   In  a  second 
part  to  his  study,  Finlayson  used  the  analysis  of  variance  in  a  197x2x6 
random  effects  design.   In  testing  the  significance  of  effects  he  found 
the  child-by-essay  interaction  significant,  suggesting  that  the  perfor- 
mance of  a  child  in  one  essay  is  not  representative  of  his/her  ability 
to  write  in  general.   The  child-by-rater  interaction  was  not  significant. 
From  his  results,  it  may  be  concluded  that  test  variation  represents  a 
greater  source  of  error  than  rater  variation. 

In  a  follow-up  study,  Vernon  and  Millican  (1954)  investigated  the 
reliability  across  7  raters  and  7  topics  for  a  sample  of  224  college 
students  using  a  general  impression  5-point  scale.   They  reported  mean 
correlations  between  raters  on  the  same  topic  and  between  topics.  These 
were  .509  and  .366,  respectively.   In  the  authors  words:   "a  still  more 
serious  source  of  inconsistency  in  assessing  English  ability  is  the 
varying  performance  of  candidates  when  writing  essays  on  different 
topics"(p.73) . 

In  view  of  the  recommendations  made  by  language  arts  specialists 
and  the  results  of  the  empirical  studies  reviewed,  it  appears  that 
extending  the  design  of  Finlayson  to  include  raters,  modes/topic,  and 
day-to-day  variation  as  possible  sources  of  error  is  in  order.   To 
best  assess  all  sources  simultaneously,  the  principles  of  generalizabil ity 
theory  will  be  applied.  The  development  of  generalizability  theory  will 
be  discussed  in  the  following  section. 
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Generalizability  Theory 

The  conceptual  underpinnings  of  generalizability  theory  are  based 
on  Fisher's  (1925)  work  on  the  analysis  of  variance,  the  factorial 
experiment,  and  the  intraclass  correlation. 

The  idea  of  using  the  analysis  of  variance  to  estimate  the  relia- 
bility of  a  test  is  due  to  Cyril  Burt  who  translated  the  work  of  Fisher 
for  his  students  with  the  aid  of  P.  0.  Johnson,  J.  Neyman,  and  R.  W.  B. 
Jackson  (Burt,  1955).   Burt  considered  measurements  as  varying  in  three 
dimensions;  with  respect  to  the  person,  the  test  form,  and  the  occasion. 
The  reliability  of  the  test  is  estimable  from  a  comparison  of  individual 
variance  to  group  variance.   In  Burt's  words: 


On  comparing  the  two  variances  it  would  then 
seem  possible,  on  intuitive  grounds,  to  infer 
that,  when  the  variance  of  the  measurements  for 
a  single  individual  becomes  as  large  as  the 
variance  for  the  entire  sample  of  different 
individuals,  the  test  used  will  be  of  no  practi- 
cal value  whatsoever:  for  the  whole  object  of 
such  a  test  is  to  distinguish  the  ability  as 
measured  for  any  given  individual  from  the 
abilities  of  the  rest.  (p. 105) 

Burt  showed  how  the  intraclass  correlation  provided  an  estimate  of  the 
reliability. 

The  intraclass  correlation  was  introduced  by  Fisher  in  the  context 
of  the  random  effects  model.   Scheffe  (1959)  illustrates  it  using  the 
model 

(4)    Yi.  =  v   +  ai  +  e^ 
where  u  is  the  grand  mean  and  a.  and  e^  are  independent  with  zero  means 
and  variance  mut rices  o  (ot)I.  and  o  (e)I^ 

,.,„,„   r(,a  ,,.,,..•„,(,.(»  «j,f  v   w«*k  f><>  expressed  as 
respectively.  tno   v»rx»«?»  "»*  r  .   *  < 
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(5)   o2(y)  =  a2  (a)  +  o2(e)  . 
The  observations  within  any  class  are  not  statistically  independent. 
The  statistical  dependence  between  any  two  observations  y^-j  and  yj-; "  i° 

the  same  class  is  expressed  as 

2 
(6)    r  intraclass  =  EfCy^  -  u)  (yi j  -  -  VUJ  /  O    (y)' 

=  E[(C4  +  eij)(ai  +  ejj')]  /  a2(y) 
=  Efc^2)  /o2(y) 

-  a2 (a)/ [a2 (a)  +  o2(e)]   . 
Thus,  the  intraclass  correlation  may  be  estimated  by  obtaining  point 
estimates  of  the  variance  components. 

Pilliner  (1952)  compared  the  estimate  of  reliability  obtained 
from  the  intraclass  correlation  to  that  obtained  from  the  Pearson 
product-moment  correlation  for  a  situation  where  measures  vary  in  two 
dimensions:   persons  and  tests  (or  items,  etc.).   Under  homogeneity  of 
variance  assumptions,  the  intraclass  correlation  provides  an  unbiased 
estimate  of  reliability.   But  if  variances  are  heterogeneous,  the 
estimates  from  the  intraclass  r  are  negatively  biased.   That  is,  they 
represent  a  lower  bound.   Pilliner  suggested  extensions  of  the  two 
dimensional  framework  where  components  of  variance  are  mostly  needed. 
His  illustration  was  a  three  dimensional  design  using  Finlayson's  data, 
for  which  his  procedures  were  derived. 

In  the  United  States,  Hoyt  (1941)  used  the  analysis  of  variance 
approach  in  determining  the  internal  consistency  of  a  test  from  a  subject  - 
by-item  design,  where  the  items  are  dichotomously  scored.   He  arrived 
at  reliability  formulas  identical  to  those  derived  by  Kuder  and 
Richardson  (1937)*. 
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Ebel  (1951)  made  a  case  for  the  use  of  the  intraclass  correlation 
for  situations  where  the  parallelism  assumption  was  impractical  due  to 
the  inequality  of  means.   He  was  interested  in  the  reliability  of 
ratings  which  he  estimated  by  applying  the  analysis  of  variance  to  a 
subjects-by- ratings  design.   The  results  from  this  approach  were 
compared  to  two  other  formulas  proposed  for  estimating  such  reliability: 
the  generalized  reliability  and  the  average  intercorrelation.   Ebel 
concluded  that  the  intraclass  formula  was  preferable  because  of  its 
flexibility  with  respect  to  the  inclusion  of  the  between  raters  variance 
in  the  error  term.   In  situations  where  the  same  raters  are  used  to  rate 
all  subjects,  the  between  raters  variance  does  not  enter  into  the  error. 
On  the  other  hand,  when  different  raters  are  used,  then  that  variance 
should  be  considered  as  error. 

In  his  1953  textbook,  Lindquist  provided  a  clear  and  comprehensive 
treatment  of  the  use  of  variance  components  in  the  estimation  of  relia- 
bility.  He  discussed  the  possibility  of  obtaining  negative  estimates 
particularly  when  the  number  of  degrees  of  freedom  is  small  for  some 
factors.   A  small  number  of  degrees  of  freedom  may  not  be  crucial  for 
variance  components  which  are  not  of  interest  (such  as  the  between  raters 
variance  discussed  by  Ebel  in  situations  where  all  raters  rate  all 
subjects).   Lindquist  also  demonstrated  that  increasing  the  number  of 
observations  in  a  study  resulted  in  different  effects,  depending  on  the 
levels  of  the  factors  sampled.   In  this  regard,  the  Spearman- Brown 
formula  has  limited  utility.   The  limitations  of  the  Spearman- Brown 
formula  for  showing  the  effects  on  reliability  from  an  increase  in  the 
levels  of  a  factor  had  been  previously  discussed  by  others  (e.g., 
Pilliner  ).   Finally,  Lindquist  illustrated  the  added  utility  of 
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estimating  variance  components  for  determining  the  relative  importance 
of  the  various  sources  of  error.   This  information  could  be  useful  in 
suggesting  designs  for  the  construction  of  measurement  schedules.   The 
idea  of  using  variance  component  estimates  for  deciding  among  different 
designs  was  later  expanded  by  Vaughn  and  Corballis  (1969). 

Using  the  analysis  of  variance  approach  and  extending  the  designs 
used  to  estimate  reliability  to  more  than  two  dimensions  implied  a 
conceptualization  of  reliability  as  a  characteristic  of  a  measurement 
procedure  rather  than  a  measurement  instrument.   This  was  the  position 
taken  by  Rajaratnam  (1960)  and,  more  recently,  discussed  by  Rowley  (1976) 
in  the  context  of  observational  measures. 

In  his  article,  Rajaratnam  introduced  the  notion  of  a  reliability 
coefficient  as  the  ratio  of  true  score  variance  to  the  observed  score 
variance  expected  in  a  set  of  observations  obtained  by  using  the  same 
measurement  procedure  in  a  specific  way.   He  formulated  coefficients  for 
situations  where  every  rater  does  not  rate  every  subject.   In  this 
situation,  as  libel  had  suggested,  the  systematic  variance  of  raters  is 
part  of  the  error  term  since  it  enters  into  the  expected  observed  score 
variance.   Rajaratnam  also  introduced  the  distinction  between  G  and  D 
studies  whicli  was  discussed  in  the  introduction. 

In  studying  the  reliability  of  classroom  observational  schedules, 
Medley  and  Mitzel  (1963)  made  use  of  the  analysis  of  variance  approach 
in  reliability  estimation.  Their  application  is  extended  to  a  four- 
way  factorial  without  replications.  These  authors  illustrate  the  vast 
amount  of  reliability  information  which  may  be  obtained  from  one  care- 
fully designed  study  using  analysis  of  variance  methods. 
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Several  articles  published  by  Cronbach,  Gleser,  and  Rajaratnam 
(Cronbach  et  al.,  1963;  Gleser  et  al.,  1965;  Rajaratnam  et  al.,  1965) 
and  culminating  in  the  publication  of  their  1972  book,  have  summarized 
the  conceptualization  of  reliability  estimation  from  the  analysis  of 
variance.   These  authors  presented  a  general  framework  which  encompasses 
the  classical  model  and  may  be  extended  to  include  experimental  designs 
for  fixed,  random,  and  mixed  models.   They  rely  heavily  on  the  paper  by 
Cornfield  and  Tukey  (1956)  dealing  with  variance  component  estimation 
for  factorials  through  the  use  of  expected  mean  squares.   Their  treat- 
ment is  limited  to  balanced  designs,  having  equal  numbers  of  observations 
in  the  subclassif ications. 

In  the  introductory  chapter,  certain  problems  associated  with  the 
classical  theory  were  presented.   One  of  these  problems  was  discussed 
by  Guttman  (1953)  in  his  critique  of  Gulliksen's  (1950)  book.   Guttman 
observed  that  the  notion  of  parallel  tests,  the  heart  of  classical 
reliability  theory,  does  not  provide  a  unique  definition  of  reliability, 
since  there  may  be  more  than  one  reasonable  basis  for  forming  parallel 
tests. 

In  their  work,  Cronbach  et  al.  (1963)  reformulated  the  theory  of 
reliability  to  overcome  the  inadequacies  presented  by  the  parallelism 
assumption.   They  rephrased  the  reliability  issue  as  follows:   "an 
investigator  asks  about  the  precision  or  reliability  of  a  measure  because 
he  wishes  to  generalize  from  the  observation  in  hand  to  some  class  of 
observations  to  which  it  belongs"(p. 144) .   Their  theory  requires  that 
the  investigator  clearly  specify  a  universe  of  conditions  of  observation 
over  which  generalization  is  to  be  made.   The  problem  of  reliability 
thus,  becomes  one  of  generalizability. 
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In  terms  of  generalizability  theory,  a  person's  universe  score, 
(analogous  to  the  classical  true  score),  is  defined  as  the  expected 
score  over  all  admissible  observations.   This  definition  is  equivalent 
to  Lord  and  Novick's  (1968)  "generic  true  score."   The  obtained  score 
is  a  sample  from  a  universe  of  admissible  observations  and  will  generally 
differ  from  the  universe  score. 

A  model  is  constructed  where  the  observed  score  is  expressed  in 
terms  of  the  hypothesized  effects.   For  example,  consider  the  model 

(7)  X  ,  =  7T  +  a.  +  e  . 

PJ    p    j    pj 

The  observed  score,  X  . ,  given  to  person  p  by  judge  j  is  the  sum  of 
three  components,  namely  iTp  ,  the  effect  for  person  p;  otj  ,  the  bias  of 
judge  j ;  and  an  error  component  e  ■>   which  may,  for  example,  represent 
some  idiosyncratic  reaction  of  judge  j  to  a  particular  person  p.   These 
components  are  assumed  to  be  independent.   Models  like  (7)  can  be  con- 
structed to  fit  any  particular  design. 

2 

The  variation  found  among  observed  scores,  O    (X) ,  may  be  parti- 
tioned into  variance  components 

(8)  o2(X)  =  o2(tt)  +  a2  (a)  +  a2(e). 

2 
o  00  represents  the  variation  due  to  persons  and,  in  Cronbach's  terms, 

the  universe  score  variance. 

Cronbach  et  al.  (1972)  make  a  distinction  between  two  error  compo- 
nents  a    (A)  and  a  (<5)  .   (this  distinction  was  previously  noted  by  Ebel 
(1951)).   To  illustrate  the  distinction  assume  that  every  judge  con- 
sidered every  person.   The  component  o    (&) ,    estimated  from  a2(e)  in 
our  model,  refers  to  the  variance  of  each  person's  observed  deviation 
scores  under  each  judge,  (X  .  -  X.),  around  the  universe  deviation  score 

(p  -  u) .   These  deviation  scores  eliminate  the  systematic  variance 
P 
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2 
among  judges,  o  (a),  since  the  mean  for  each  judge  is  subtracted  from 

the  raw  score  to  obtain  the  deviation  score.   In  general,  the  systematic 

variance  of  facets  where  the  same  conditions  are  sampled  for  every  person 

2  2 

is  excluded  from  a  (6).   The  error  variance  a" (A) ,  refers  to  the  variance 

of  each  person's  observed  scores,  X  ,  around  their  universe  score,  up- 

In  our  example,  a  (A)  =  o^(a)  +  a2(e).   In  the  classical  sense,  this 

variance  component  is  the  only  component  of  error.   The  square  root  of 

2  2 

0  (A)  is  the  standard  error  of  measurement.   It  will  be  noted  that  o  (A) 

will  generally  be  greater  than  a  (6) . 

The  emphasis  of  generalizability  theory  is  on  the  estimation  of  the 
variance  components.   These  variance  components  have  several  uses,  one 
of  which  is  the  estimation  of  generalizability  coefficients  via  intra- 
class  correlations.   The  coefficient  of  generalizability  is  defined  as 
the  ratio  of  the  universe  score  variance  to  the  expected  observed  score 
variance.   It  is  approximately  the  expected  value  of  the  squared 
correlations  of  observed  score  and  universe  score,  Ep  (X  u  ).   The 
intraclass  correlation  is  a  good  approximation  of  p  (Xuj  if  homogeneity 
of  variance  assumptions  are  met.   Maxwell  and  Pilliner  (1968)  and 
Selvage  (1976)  have  recommended  performing  transformations  on  the  data 
to  achieve  stability  of  variances  when  the  assumptions  are  not  met. 

The  variance  components  are  also  used  in  planning  designs  for  D- 
studies.   When  making  absolute  decisions,  it  is  desirable  to  reduce  the 
error  o"  (A) .   According  to  Cronbach  et  al.  (1972)  a  nested  design  reduces 
a^(A)  more  than  a  crossed  design  with  the  same  number  of  observations  per 
person  since  more  conditions  are  sampled  in  the  nested  design.   For 
comparative  decisions  a (o)  is  the  appropriate  error  to  consider  in 
determining  the  adequacy  of  the  measurement  procedure.   The  magnitudes 
of  the  variance  components  provide  an  indication  of  the  relative 
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contribution  of  the  different  effects  to  the  error.  This  knowledge 
is  useful  in  determining  the  number  of  conditions  to  be  sampled  from 
each  facet  in  subsequent  D  studies  in  order  to  maintain  the  error  at 
a  specified  level. 

Cronbach  et  al.  (1972)  consider  a  third  type  of  error,  o'fe),  the 
error  of  estimate.   It  is  the  square  root  of  the  familiar  variance  for 
errors  of  estimate  in  linear  regression.   The  regression  equation  they 
consider  is  that  for  predicting  u   from  the  observed  score  and  group 
information.   According  to  Cronbach  et  al.  (1972,  p. 15),  the  universe 
score  is  "the  ideal  datum  on  which  to  base.  .  .decision  [s]."  They 
recommend  estimating  universe  scores  through  linear  regression  and 
setting  confidence  intervals  around  the  estimated  true  score  using  a (A). 
The  estimated  universe  scores  are  not  very  useful  if  all  scores  are 
regressed  to  the  population  mean;  since  they  will  be  perfectly  correlated 
with  the  observed  scores.   But  if  subpopulations  of  persons  exist  with 
different  means,  the  universe  score  may  be  predicted  from  the  observed 
score  and  the  subpopulation  information. 

In  their  book,  Cronbach  et  al.  provide  detailed  examples  of  the 
application  of  generalizability  theory  to  simple  experimental  designs 
involving  both  crossed  and  nested  facets.   They  also  extended  the  theory 
to  encompass  multivariate  problems. 

Since  the  publication  of  Cronbach's  book  several  authors  have 
applied  the  principles  of  generalizability  theory  to  various  situations. 
Levy  (1974)  applied  the  theory  to  studies  of  reliability  in  clinical 
settings;  and  Gillmore,  Kane,  and  Naccarato  (1978)  to  student  ratings 
of  instruction. 
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In  the  spirit  of  generality,  Mellenbergh  (1977)  has  recently 
proposed  a  more  extended  view  of  reliability  by  considering  all  possible 
replications  of  the  design  where  in  addition  to  replications  of  facets, 
replications  of  subjects  for  fixed  facets  is  also  possible.   He  sug- 
gested using  replicability  coefficients  which  are  defined  as  the 
correlation  between  two  replications  of  the  design.   His  coefficients 
include  generalizability  coefficients  and  also  make  use  of  estimates 
of  the  variance  components.   Several  of  the  possible  coefficients, 
however,  serve  no  interesting  purpose  in  most  practical  situations. 

Brennan  (1975)  extended  the  idea  of  calculating  reliability  from 
a  person-by-item  analysis  of  variance  to  a  situation  where  persons  are 
nested  within  some  higher  order  dimension.   Assuming  an  equal  number  of 
persons  in  each  class,  Brennan  compared  the  generalizability  coefficients 
derived  from  a  split-plot  factorial  design  with  students  nested  within 
classes  and  crossed  with  items  to  those  derived  when  the  nesting  clas- 
sification (i.e.  classes)  is  ignored  (a  randomized  blocks  design).   He 
concluded  that  "the  experimental  model  used  to  collect  data  for  most 
reliability  studies  is  usually  one  where  students  are  nested  within 
some  dimension;  therefore,  the  split-plot  design  would  appear  to  be 
more  appropriate  than  a  simple  randomized  block  design"   (p. 780).  In 
addition,  the  split-plot  design  can  be  used  to  provide  a  basis  for 
estimating  the  reliability  of  scores  for  the  units  within  which  persons 
are  nested. 

For  his  design,  Brennan  stated  that  the  reliability  of  the  test  of 
specified  length  calculated  from  the  split-plot  design  would  be  less 

than,  equal  to,  or  greater  than  that  calculated  from  a  randomized  block 

2 
design  depending  upon  whether  the  ratio  of  a  (p)  (the  person 
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9 
variance  component)  to  a  (e)  (the  error  variance)  is  less  than,  equal 

2 
to,  or  greater  than  the  ratio  of  cT(s)  (the  school  variance  component) 

2 
to  a  (si)  (the  school  by  item  variance  component). 

Thus,  if  one  uses  a  randomized  block  design  to 
calculate  reliability  for  persons  when,  in  fact, 
persons  are  nested  within  some  dimension,  such 
as  schools  or  classrooms,  the  resulting  coeffi- 
cient will  be  biased,  and,  moreover,  the 
direction  of  bias  will  be  unknown. (p. 785) 

Kane  and  Brennan  (1977)  extended  generalizability  theory  to  a  split- 
plot  design  in  which  students  were  nested  within  classes  and  crossed 
with  items.   Their  purpose  was  to  estimate  the  generalizability  of  a 
class  mean,  where  the  class  was  the  unit  of  analysis.   They  assumed  an 
equal  number  of  students  in  each  class.   Four  different  coefficients 
were  formulated  corresponding  to  four  universes:   an  infinite  universe 
of  students  and  items,  a  fixed  universe  of  students  and  items,  a 
universe  with  fixed  students  and  infinite  items,  and  a  universe  with 
infinite  students  and  fixed  items. 

The  situation  where  the  students  are  fixed  is  somewhat  artificial 
since,  in  educational  research,  it  is  generally  inappropriate  to 
restrict  the  universe  of  generalization  for  the  student  facet.  Restrict- 
ing the  set  of  both  items  and  students  is  very  unlikely.   The  universe 
score  variance  in  this  case  is  estimable  if  the  interaction  effect  for 
students  and  items  and  the  error  in  the  model  are  not  confounded,  that 
is,  if  there  is  moie  than  one  replication  of  each  class-student-item 
observation  or  if  the  student-item  interaction  is  assumed  to  be  zero 
and  its  estimate  taken  as  the  error  estimate. 

In  a  subsequent  section,  the  authors  showed  how  certain  coeffi- 
cients may  be  estimated  from  mixed  models.   However,  since  the  components 
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from  a  model  with  a  fixed  facet  cannot  be  used  to  estimate  a  generali- 
zability coefficient  that  assumes  generalization  over  that  facet,  the 
authors  recommended  a  random  model  in  the  estimation  of  variance  compo- 
nents. 

Kane  and  Brennan  also  related  three  coefficients,  which  appear  in 
the  literature  for  estimating  the  reliability  of  class  means,  to  their 
four  generalizability  coefficients.   None  of  the  four  reliability 
coefficients  is  equivalent  to  their  generalizability  coefficient  where 
generalization  is  intended  over  students  and  items,  a  very  common 
situation. 

Generalizability  theory  offers  innumerable  possibilities  for  well 
designed  studies  to  be  conducted  as  part  of  instrument  development. 
Much  information  may  be  gained  from  one  G-study,  some  of  which  is 
unattainable  under  the  classical  approach.   As  the  principles  are 
applied  to  various  measurement  problems,  their  strengths  and  limitations 
will  become  apparent .More  applications  are  needed  in  all  areas.   To  this 
author's  knowledge  the  theory  has  not  been  applied  to  the  assessment  of 
writing  ability.   The  studies  by  Finlayson  (1951)  and  Vernon  and  Millican 
(1954)  approximate  this  effort.   However,  these  studies  only  reported 
tests  of  hypotheses  and  interclass  correlation  coefficients  and  did  not 
use  estimates  of  variance  components.   This  applied  study  extended  the 
design  used  by  Finlayson  and  incorporated  a  method  of  estimating  variance 
components  for  unbalanced  data. 

Variance  Component  Estimation 
Thus  far,  all  references  to  generalizability  theory, both  theoretical 
and  applied, have  assumed  balanced  designs.   For  balanced  designs  the 
analysis  of  variance  method  of  estimation  is  universally  accepted.   The 
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expected  values  of  the  mean  squares  may  be  expressed  as  linear  combina- 
tions of  the  variance  components.   The  coefficients  of  the  components 
are  easy  to  obtain  by  rules  developed  by  Cornfield  and  Tukey  (1956)  for 
fixed,  random,  and  mixed  models.   These  rules  appear  in  standard  texts 
such  as  Kirk  (1 968 J  and  Winer  (1971).   The  best  method  of  estimation  is 
to  equate  the  observed  mean  squares  from  the  analysis  of  variance  under 
fixed  effects,  to  the  linear  combination  of  variance  components.   Then 
the  resulting  set  of  simultaneous  equations  is  solved  for  the  variance 
components.   These  estimates  are  minimum  variance  and  are  unbiased 
(Searle,  1971b). 

Most  methods  of  estimating  variance  components  involve  some 
quadratic  form  of  the  observations.   The  mean  squares  from  the  analysis 
of  variance  are  the  appropriate  quadratics  to  use  when  the  design  is 
balanced.   Estimating  variance  components  from  unbalanced  data  is  more 
complex  because  there  is  no  universally  accepted  method.   According 
to  Searle  (1971b,  p. 33)  "no  particular  set  of  quadratics  has  been 
established  as  being  more  optimal  than  any  other  set."   For  unbalanced 
designs,  using  the  analysis  of  variance  procedure  leads  to  the  question 
of  which  mean  squares  to  use,  since  with  unbalanced  data  the  mean  squares 
may  be  unadjusted  or  adjusted  for  one  or  more  effects. 

A  comprehensive  review  of  methods  of  estimation  based  on  the 
analysis  of  variance  has  been  given  by  Searle  (1971a,  1971b)  for  both 
balanced  and  unbalanced  designs.   For  the  latter  case,  Searle  discussed 
three  methods  proposed  by  Henderson  (1953).   Henderson's  method  1  consists 
of  equating  the  unadjusted  sums  of  squares  from  the  fixed  effects 
analysis  of  variance  to  their  expectations  obtained  under  a  random 
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effects  model.  These  expectations  are  linear  combinations  of  the 
variance  components.   Thus,  solving  for  the  set  of  simultaneous  equations 
will  yield  estimates  of  the  components.   This  method  produces  unbiased 
estimates  except  for  the  random  effects  in  mixed  models. 

Henderson's  method  2  was  developed  to  correct  the  inefficiency  of 
method  1  with  mixed  models.   The  procedure  of  the  second  method  is  to 
"correct"  the  data  by  some  previous  least  squares  estimates  of  the  fixed 
effects.   Using  the  "corrected"  data  in  place  of  the  original  data, 
method  2  proceeds  as  method  1.   This  method  is  inappropriate  when  there 
are  interactions  between  the  fixed  and  random  effects. 

The  method  of  fitting  constants,  or  Henderson's  method  3,  uses  the 
adjusted  sums  of  squares--adjusted  sequentially—and  follows  the  same 
pattern  as  the  other  methods.   The  adjusted  sums  of  squares  are  similar 
to  those  of  Overall  and  Spiegel's  (1969)  "a  priori  ordering."  All  ex- 
pectations of  these  adjusted  sums  of  squares  are  taken  under  the  full 
model.   Under  this  condition,  the  expected  value  of  any  term  involves 
all  of  the  variance  components  except  those  for  the  terms  for  which 
this  term  was  adjusted. 

As  Searle  pointed  out,  the  coefficients  of  the  variance  components 
for  these  methods  are  not  as  easy  to  obtain  as  those  with  balanced  data. 
He  gives  several  references  which  discuss  numeric  methods  for  obtaining 
the  coefficients. 

More  recently  Rao  (1971,  1972)  has  proposed  a  different  approach 
to  the  estimation  of  variance  components.   His  methods  called  MINQUE 
(minimum  norm  quadratic  unbiased  estimation)  and  MIVQUE  (minimum  variance 
quadratic  unbiased  estimation)  provide  a  general  approach  which  is 
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applicable  to  both  balanced  and  unbalanced  designs  and  suitable  for 
either  random  or  mixed  models. 

To  summarize  them,  let  us  consider  the  model 

(9)  Y  =  XB  +  U  E      +  U  E     +  .  .  .  +  H  I 

~       —   —1  — 1   —2—2  — k  —  k 

where  Y  is  the  n  x  1  vector  of  observations,  X_  is  a  n  x  m  design  matrix 

for  the  fixed  effects  (in  a  random  effects  model  X  is  just  a  column 

vector  of  l's),  B  is  a  vector  of  unknown  parameters  (the  grand  mean  in 

a  random  effects  model),  U.  is  a  given  n  x  c  matrix,  the  columns  of 

—i  l 

which  are  the  coded  variables  for  a  particular  factor,  and  E       is  a  c 

—  i       i 

vector  of  uncorrelated  variables  for  the  ith  random  effects  factor  in 

the  model  (which  may  be  a  main  effects  factor  or  an  interaction  factor) . 

2 
The  t,  .'s  have  zero  mean  and  variance  matrix  a   I    i=l,.  .  .  ,  k, 
— i  i  — ci 

2 
where  o*-   are  unknown.   Furthermore,  E,     and  E,      (i^j)  are  uncorrelated. 

~i    ~j 
The  kth  factor  is  the  error  term.   Then 

(10)  E(Y)  =  X  B, 

(11)  V*  =  Var(Y)  =  a \   V j  + .  .  .+  o\   VR  where 

t12)  y.i=  MiMi  • 

Rao  defined 


(13)  V  =  I   V-   . 

i=l 
The  problem  then  is  to  estimate  the  variance  components  0"  i ,  .       o  k 

Rao  considered  the  estimation  of  a  linear  function 

(14)  Vla\   +  .  .  .  +  p^ 

of  the  variance  components  from  a  quadratic  function  Y  A  Y  of  the 

observations. 

A  is  symmetric  and  is  chosen  to  satisfy  the  following  conditions: 

(a)  A  X  =  0 

(b)  E(Y^AY)  =  pa2  +.  .  .+  pa? 

LI  K  K  • 
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Condition  (a)  is  necessary  for  the  estimator  to  be  invariant  to  changes 

in  E    (Rao,  1972).   For  condition  (b)  to  be  true  (i.e.  the  estimator  is 

unbiased),  then  tr  A  V.  =  p.,  i  =  1,  ,  .  .,k  where  tr  represents  the 
—  — i    i  * 

trace  of  a  matrix  (the  sum  of  the  diagonal  elements).   To  obtain  the 

MINQUE  estimator,  the  Euclidean  norm  tr  (V*A)   is  minimized.   This 

2 
requires  some  a  priori  knowledge  of  the  ratios  of  a  ^.   T°  obtain  the 

MIVQUE  estimator,  the  variance  of  Y_'\   Y^  is  minimized  for  a  particular 

2 
choice  of  a, ,    •  •  •  ,  o,   .   That  variance  is  Var  (YAY)  =  2  tr(V*A)   +  a 

term  in  A  and  kurtosis  parameters,   Under  normality  assumptions,  the 
kurtosis  parameters  are  zero  and  MINQUE  equals  MIVQUE. 

Rao's  methods  are  preferable  to  those  proposed  by  Henderson  for 
three  reasons.   First,  they  have  a  wider  range  of  applicability  since 
they  can  accommodate  mixed  as  well  as  random  models.   Second,  the 
computations  involved  are  more  efficiently  programmable.   Third,  when 
prior  estimates  of  the  components  are  available,  the  MIVQUE  method 
provides  estimates  which  are  locally  minimum  variance. 

The  second  reason  is  relevant  to  G-studies  because  the  designs 
used  in  such  studies  tend  to  be  large.   As  was  mentioned  previously,  a 
G-study  should  include  as  many  sources  of  error  variance  related  to  a 
measurement  procedure  as  possible.   For  each  facet  included,  the 
maximum  number  of  conditions  possible  should  be  sampled.   The  resulting 
design  then  requires  the  most  efficient  method  for  its  analysis.   Rao's 
methods  satisfy  this  criterion. 

Summary 

The  literature  pertinent  to  the  measurement  of  writing  ability 
indicates  that  essay  tests  represent  the  most  valid  method  of  assessment. 
Several  sources  of  error  have  been  identified  as  affecting  the 
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reliability  of  this  test  form.   Although  variability  among  raters  is  the 
source  most  commonly  examined,  day-to-day  and  assignment  variability  are 
considered  to  be  equally  or  more  important.   Missing  from  the  literature 
are  empirical  studies  which  examine  how  these  sources  affect  the 
reliability  of  the  essay. 

Generalizability  theory  offers  a  conceptual  framework  which  is 
applicable  to  the  study  of  multiple  sources  of  error  variation.   Based 
on  Fisher's  work  on  the  analysis  of  variance,  in  the  theory,  the  problem 
of  reliability  is  considered  as  one  of  generalization  from  one  observation 
to  a  universe  of  admissible  observations. 

In  a  generalizability  study,  the  observations  are  gathered  under 
a  specific  design  characterized  by  facets,  the  identified  sources  of 
error.   The  conditions  of  each  facet  included  in  the  design  may  be 
fixed  or  sampled  from  the  total  universe  of  conditions.   The  relative 
magnitude  of  the  sources  of  error  variation  is  determined  through  the 
estimation  of  variance  components.   For  purposes  of  simplification  in 
the  estimation  process,  generalizability  theory  has  been  restricted  to 
balanced  designs.   The  literature  on  generalizability  theory  is  lacking 
in  applied  studies,  although  content  areas  such  as  writing  could  greatly 
profit  from  its  application. 

Also  missing  from  the  psychometric  literature  are  extensions  of 
the  theory  to  unbalanced  designs.   These  extensions  are  much  needed 
since  these  designs  are  typical  in  educational  research.   Psychometricians 
could  profit  from  methods  of  estimating  variance  components  documented 
in  the  statistical  literature.   In  particular,  the  methods  of  Henderson 
(1950)  and  Rao  (1971,  1972)  are  applicable  to  unbalanced  data.   These 
methods  allow  the  principles  of  generalizability  theory  to  be  further 
extended. 


CHAPTER  III 
METHOD 

This  study  was  designed  to  demonstrate  the  application  of  generali- 
zability  theory  to  the  assessment  of  writing  ability.   The  data  were 
collected  in  a  natural  setting  on  a  sample  of  fourth  grade  children. 
The  study  extended  the  application  of  the  theory  to  a  situation  where 
unequal  but  proportional  numbers  of  subjects  appeared  in  the  sub- 
classifications  . 

The  sample,  the  facets,  the  design,  and  the  procedures  for  data 
collection  and  analysis  are  described  in  this  chapter. 

The  Sample 

The  sample  used  in  this  study  consisted  of  104  fourth  grade 
students  from  eight  classes  in  two  schools  in  Alachua  county;  four  from 
P.  K.  Yonge  Laboratory  School  and  four  from  Alachua  Elementary  School. 
The  data  used  in  this  study  were  collected  as  part  of  a  research  project 
on  creative  writing  conducted  at  those  schools. 

P.  K.  Yonge  is  a  laboratory  school  associated  with  the  College  of 
Education  at  the  University  of  Florida.   The  student  population  at 
each  grade  level  is  selected  from  a  waiting  list  in  such  a  way  to 
approximate,  in  each  classroom,  an  equal  balance  between  males  and 
females;  a  20:80  racial  balance  between  blacks  and  white  or  others, 
respectively;  and  an  equal  balance  from  each  of  five  income  categories. 
Fourth  and  fifth  grades  are  combined  in  the  classrooms  at  this  school. 
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The  four  classrooms  participating  in  this  study  exhausted  those  class- 
rooms containing  fourth  grade  students.   A  total  of  59  fourth  grade 
students  are  currently  enrolled  at  P.  K.  Yonge.   However,  only  37  who 
had  complete  data  were  used  in  this  study. 

Alachua  Elementary  is  a  public  school  in  the  rural  town  of  Alachua. 
The  four  classrooms  from  this  school  also  exhausted  the  fourth  grade 
population.   In  this  school,  students  at  each  grade  level  are  assigned 
to  classrooms  to  maintain  the  sex  and  race  balance  previously  described. 
A  total  of  67  students  in  this  school  had  complete  data  out  of  an  initial 
sample  of  113.   Thus,  the  writing  samples  used  in  this  study  were 
obtained  from  a  total  of  104  individuals.   The  sample  sizes  for  each 
class  are  shown  in  Table  1,  broken  down  by  sex  and  race. 
The  Writing  Samples:   Data  Collection 

Samples  of  compositional  writing,  in  two  different  writing  modes, 
were  collected  on  three  occasions.   On  each  occasion,  verbal  and 
written  instructions  were  given  to  the  children  by  one  of  the  staff 
members  of  the  project.   The  same  person  collected  the  samples  through- 
out the  occasions  at  each  school.   Steps  were  taken  to  insure  that  the 
children  understood  the  task.   Furthermore,  praise  was  used  in  an 
attempt  to  motivate  the  children  to  write.   On  each  occasion,  the 
assignment  and  the  instructions  were  standard  for  all  students.   Each 
student  was  allowed  sufficient  time  to  complete  the  task.   On  the  average, 
the  compositions  were  completed  in  approximately  45  minutes. 

The  Facets 

The  writing  samples  were  characterized  by  two  facets:   modes  and 
occasion,   A  third  facet,  raters,  was  introduced  in  scoring  the  samples. 
The  levels  of  these  facets  which  were  used  in  this  study  are  described  next. 
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TABLE  1 
SAMPLE  SIZES  BY  CLASSROOM,  SEX,  AND  RACE 


Sex  Race 

Classroom  Male       Female     Black     White     Total 


3 

5 

1 

7 

8 

■1 

4 

2 

6 

8 

8 

3 

2 

9 

11 

3 

7 

3 

7 

10 

6 

11 

4 

13 

17 

8 

11 

7 

12 

19 

6 

5 

2 

9 

11 

8 

12 

4 

16 

20 

Note:   Classrooms  1  through  4  are  from  P.  K.  Yonge  and 
5  through  8  are  from  Alachua  Elementary. 
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Modes 

The  mode  facet,  as  conceptualized  in  this  study,  was  characterized 
by  two  dimensions:   the  purpose  of  the  writing  sample  and  the  type  of 
assignment.   This  use  of  the  word  mode  is  broader  than  the  traditional 
use.   Generally,  four  basic  writing  modes  are  mentioned  in  the  litera- 
ture related  to  factors  which  influence  children's  writing  ability. 
These  are:   narrative,  declarative,  argumentative,  and  expository. 
Each  of  these  modes  constitutes  a  different  purpose.   For  example, 
the  purpose  of  writing  in  the  narrative  mode  is  to  tell  a  story;  that 
of  the  argumentative  mode  is  to  convince  the  audience.   For  each  one 
of  these  purposes,  different  types  of  assignments  are  possible.   A 
child  who  is  asked  to  write  in  the  narrative  mode  may  tell  his/her 
story  through  a  poem,  a  letter,  a  report,  etc.   Characterizing  the 
type  of  writing  along  these  two  dimensions  allows  for  a  large  number  of 
possible  conditions  on  this  facet.   In  this  study,  generalization  was 
intended  to  all  of  the  possible  conditions  thus  identified. 

Two  types  of  writing  assignment  were  used  in  this  study,  each 
representing  a  different  writing  purpose.   In  one  mode,  children  were 
instructed  to  prepare  a  brief  report  about  specific  animals  using  a 
standard  set  of  facts  supplied  by  the  investigator.   The  facts  were 
presented  either  in  written  form  or  with  the  aid  of  a  film.   On  the 
first  occasion  a  list  of  facts  about  bats  was  provided  for  the  children. 
Films  about  cows  and  pigs  provided  the  facts  used  on  the  second  and 
third  occasions,  respectively.   After  the  presentation  of  the  stimuli, 
the  facts  were  discussed  with  the  children. 

In  the  second  mode,  the  children  were  asked  to  write  a  creative 
story  explaining  some  imaginary  phenomenon  such  as  "how  the  camel  got 
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the  hump".   On  each  occasion,  a  list  of  titles  was  provided  for  the 

children  from  which  they  were  to  select  one. 

Occasions 

The  writing  samples  were  collected  on  three  occasions  during  the 
1977-78  school  year:   Fall,  Winter,  and  Spring,   On  each  occasion,  the 
descriptive  reports  were  collected  one  week  before  the  narrative  stories. 
This  order  was  maintained  because  the  investigators  felt  that  there 
would  be  less  carry-over  from  a  report  to  a  story  than  vice  versa.   The 
one  week  time  period  within  an  occasion  was  allowed  for  two  reasons: 
to  minimize  carry-over  effects  and  to  maximize  the  motivation  of  the 
children.   With  children  at  the  elementary  level,  there  is  a  loss  in 
motivation  when  similar  tasks  are  assigned  in  the  same  day. 

Writing  performance  is  expected  to  fluctuate  from  day  to  day. 
Furthermore,  it  is  expected  that  children's  writing  ability  will  also 
fluctuate  (hopefully  improve)  during  the  year.   In  this  study,  genera- 
lization was  intended  to  any  time  during  the  school  year. 
Raters 

The  four  raters  represent  a  sample  of  raters  which  could  have  been 
used.   Three  of  the  raters  were  graduate  students  in  educational  research; 
the  fourth  rater,  an  associate  professor  in  the  same  department. 
Generalization  along  this  facet  is  intended  to  any  person  who  would  rate 
a  sample  of  writing  for  the  purpose  of  making  a  decision  about  placement, 
selection,  grading,  or  for  purposes  of  comparison  in  a  research  study. 

The  writing  samples  were  collected  and  sorted  into  six  modes-by- 
occasion  combinations,   The  children's  names  were  covered  and  a  number 
was  assigned  and  written  on  their  sample  for  identification.   Thus, 
the  anonymity  of  the  samples  was  preserved.   The  raters  scored  the 
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samples  on  eight  different  days.   Each  day,  the  four  raters  scored 
the  samples  using  a  general  impression  scoring  method.   At  the  begin- 
ning of  each  scoring  session,  the  raters  reviewed  the  criteria  to  be 
used  in  scoring.   After  scoring  several  samples,  the  raters  compared 
their  scores  and  discussed  samples  which  had  received  divergent  scores. 
These  discussions  were  an  attempt  to  increase  the  inter-rater  relia- 
bility.  Each  sample  was  scored  independently. 

Prior  to  the  first  rating  session,  the  raters  were  trained 
in  using  the  general  impression  method.   Samples  from  fifth  grade 
students  were  used  for  training.   During  training,  the  scaling  points 
were  determined  so  as  to  obtain  an  approximation  to  a  normal  distribution. 
Normality  was  not  a  consideration  during  the  actual  scoring  of  the 
samples.   A  general  impression  method  of  scoring  used  in  this  study 
involved  assigning  a  score  of  1  through  8  on  the  basis  of  the  overall 
quality  of  the  writing  sample.   The  method  involves  the  rapid,  impres- 
sionistic scoring  of  a  sample.   Generally,  no  more  than  two  minutes  are 
spent  on  any  one  paper. 

This  procedure  has  been  used  by  the  Educational  Testing  Service  (ETS) 
and  the  College  Entrance  Examination  Board,  and  was  also  used  in  the 
first  national  assessment  of  writing  conducted  by  the  National  Assessment 
of  Educational  Progress  (NAEP)  (Mellon,  1975).   The  ETS  research  on 
rater  reliability  in  the  1960's  revealed  that  multiple  ratings  based  on 
overall  impressions  were  the  best  means  of  achieving  inter-rater  relia- 
bility (Suhor,  1977).   An  additional  advantage  to  this  method  is  the 
fact  that  it  requires  less  time  than  any  other  method. 

Pes  i  gn 

A  schematic  representation  of  the  design  used  in  this  study  is 
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FIGURE  1 


SCHEMATIC  REPRESENTATION  OF  THE 
DESIGN  INCLUDING  CLASSES (C),  STUDENTS  (S), 
OCCASIONS (0),  MODES (M) ,  AND  RATERS (R) 
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shown  in  Figure  1.   The  design  is  referred  to  as  a  split-plot  factorial 
design  in  standard  texts  (e.g.,  Kirk,  1968  ;  Winer,  1971)  with  the 
classes  being  the  main  plots  and  the  students  being  the  subplots.   In 
this  design  students  are  nested  within  the  classes;  that  is,  each  student 
appears  in  only  one  class,  This  situation  is  typical  in  the  natural 
setting.   The  nesting  of  students  within  class  results  in  the  confound- 
ing of  the  student  by  class  interaction  with  the  student  effect.   As  a 
result,  there  is  no  way  to  estimate  the  student  effect  independent  of 
the  class-by-student  interaction.   Similarly,  any  interaction  term 
involving  the  student  effect  is  confounded  with  the  corresponding 
interaction  term  involving  class-by-student.   Since  typically  students 
are  nested  within  classrooms,  the  confounding  of  the  effects  mentioned 
above  does  not  present  any  problems. 

The  students,  modes,  occasions,  and  raters  are  factorially  combined. 
In  terms  of  this  study,  this  factorial  combination  means  that  each 
student  was  measured  in  both  modes  on  each  occasion  and  that  each  rater 
scored  every  writing  sample.   The  crossing  of  students,  modes,  occasions, 
and  raters  allows  for  the  independent  estimation  of  each  main  effect 
and  all  interactions  involving  those  effects.  The  levels  of  all  factors 
included  in  this  study  were  considered  to  be  random  samples  of  all 
possible  levels  which  could  have  been  included.   Thus,  the  model  used 
is  a  random  effects  model. 

Let  X      denote  the  rating  received  by  students  in  class  c  for 
mode  m  ,  occasion  o  ,  and  rater  r  .   Then  the  structural  model  used  in 
this  study  may  be  represented  as: 
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(15)      X  =  u  +  a     +tt,,+   B     +aB       +Btt      ,   .    + 

1      '        csmor        f  c  s(c)        m  cm  ms(c) 

y0   +  a^co   +  ^os(c)    +   9r  +  a9cr  +   9^rs(c)    + 

By       +  cxBy         +   Bytt        ,   ,    +  B9       +  aB6         + 
'mo  ' cmo  '    mos(cJ  mr  cmr 

B6lTmrs(c)    +  Y9or  +  aY9cor   +  Y97Tors(c)    + 

a  By  6  +  By6tt  ,  , 

1    cmor  '      mors(c) 

where  u  -   the  grand  mean, 

ar  =  the  effect  for  class  c  (c  =  1,.  .  .,  n~) . 


mor 


rs(c) 


the  effect  for  student  s  nested  within  class  c 


(s  -  1,.  .  .,  ns(c)), 
Bm  =  the  effect  for  mode  m  (m  =  1,  .  .  .,  nm) , 
aBcm  =  the  class-by-mode  interaction  effect, 
Bit  -rc\    ~   the  mode- by- student  (nested  within  class  c) 

interaction  effect, 
Y  =  the  effect  for  occasion  o  (o  =  1 ,  .  .  .  ,n), 
aYco  =  the  class- by-occasion  interaction  effect, 
yrr  <c\    ~   the  occasion- by- student  (nested  within  class  c) 

interaction  effect, 
8r  =  the  effect  for  rater  r  (r  =  1 ,  .  .  .  ,  nr) , 
ccG   =  the  class- by-rater  interaction  effect, 
07T   fc-j  =  the  rater- by- student  (nested  within  class  c) 

interaction  effect, 

BYmo  =  the  mode- by- occasion  interaction  effect, 
aBYcmo  =  the  class- by- mode_by-occasion  interaction  effect, 
BY'%os  (cl  =  tne  mode- by- occasion-by- student  (nested  in  class  c) 
interaction  effect, 
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cxB0    =  the  class-by-mode-by-rater  interaction  effect, 

cmr  J  ' 

B0ti   f   ■>  =  the  mode-by-rater-by-student  (nested  in  class  c) 
mrs(c)  J  ' 

interaction  effect, 

Y0   =  the  occasion-by-rater  interaction  effect, 

'or  J 

(xyO    =  the  class-by-occasion-by-rater  interaction  effect, 

1  cor  '  ' 

Y07I    r   ^  =  the  occas ion-by- rat er- by- student  (nested  in  class  c) 
'   ors(c)  J 

interaction  effect, 

By6    =  the  mode-by-occasion- by- rater  interaction  effect, 

'  mor  : 

«ByO     =  the  class-by-mode- by-occasion- by-rater  interaction 
1  cmor  J  '  J 

effect,  and 

By6tt    ,■  ,  =  the  mode-by-occasion- by- rater- by- student  (nested  in 
'   mors(cJ  J  J  ' 

class  c)  interaction  effect. 
It  is  assumed  that  each  effect  in  the  model  (except  for  the  grand 
mean)  is  a  random  variable  with  a  mean  of  zero  and  variance  a  (effect). 
The  effects  are  assumed  to  be  independent  of  each  other  so  that  the 
total  variance  in  the  scores  Xcsmor  can  be  partitioned  as 

(16) 

2^222        2        ?2 
a   (x)  =  a"  (a)  +  a  (tt)  +  a  (B)  +  a  (aB)  +  a  (Bit)  +  a  (y)  +  a  (ay)  + 

2        2       2        2        2        2         2 
a  (yir)  +  a  (G)  +  a  (a6)  +  a  (6it)  +  o  (By)  +  a  (a By)  +  a  (Byrr)  + 

2        2         2         2        2         2         2 
a  (B0)  +  a  (aB6)  +  a  (B6tt)  +  a  (yO)  +  a  (ay 6)  +  o  (yBir)  +  a  (By0)  + 

2         2 
a  (aBy0)  +  a  (By0TT). 

?  2  - 

The  variances  a~a, ■    .  .,  O  g  Qtt  are  called  variance  components  (Scheffe, 

1959)  and,  therefore,  the  model  is  referred  to  as  a  variance  component 

model.   To  estimate  variance  components  it  is  not  necessary  to  assume 

that  the  effects  are  normally  distributed. 


45 


Variance  Component  Estimation 

To  estimate  the  variance  components  in  (16),  a  new  version  of  the 
SAS  VARCOMP  procedure  was  used  (Goodnight,  1978).   This  procedure, 
called  MIVQUEO,  is  based  on  the  MIVQUE  (minimum  variance  quadratic 
estimator)  method  developed  by  Rao  (1971).   The  method  estimates  linear 
functions  of  the  variance  components  through  the  use  of  quadratic 
functions  of  the  observations  which  have  minimum  variance  for  a  particu- 
lar choice  of  aj,  .  .  .  ,0^..   The  VARCOMP  program  selects  o"i,  •  •  -,0k  so 
as  to  minimize  the  ratio  of  the  variance  for  each  effect  to  the  residual 
variance.   The  resulting  estimates  are  invariant,  locally  best  (at  zero) 
quadratic  unbiased  estimates  of  the  variance  components  (Goodnight, 
1978).   The  program  used  was  the  only  one  available  to  handle  the  size 
of  the  design  matrix  within  a  reasonable  amount  of  computer  time  and 
space. 

For  balanced  split-plot  factorial  designs  the  expected  mean 
squares  are  linear  combinations  of  the  variance  components.   In  this 
case,  the  observed  mean  squares  from  the  analysis  of  variance  may  be 
used  in  the  formulas  shown  in  Appendix  A  to  estimate  the  variance  com- 
ponents . 

Generalizability  Coefficients 

Tests  for  homogeneity  of  variances  were  performed  on  the  basis  of 

warnings  by  Cronbach  et  al.  (1972,  p.  100-101).   In  their  words: 

Where  there  is  crossing  of  persons  with  facet  i 
(or  j,  etc.)  observed-score  variances  may  differ 
from  one  application  of  the  design  to  the  next, 
and  intercorrelations  between  pairs  of  indepen- 
dently obtained  observed  scores  may  differ.  The 
intraclass  correlation  (our  coefficient  of  gene- 
ralizability) truly  equals  the  mean  of  p  (X,up) 


if) 


only  if  all  observed-score  variances  are  equal. 
One  must  be  hesitant,  then,  in  taking  the 
coefficient  of  generalizability  as  representing 
the  parameter  p2(X,Up)  for  any  particular  D- 
study  with  crossed  conditions. 

To  test  for  violations  of  homogeneity  assumptions  for  this  design, 

a  procedure  suggested  by  Box  (1950)  and  recommended  by  Kirk  (1968)  was 

used.   The  procedure  involved  the  following: 

1.  testing  the  equality  of  the  variance-covariance  matrices 
across  the  eight  classes;  and  if  this  hypothesis  was  not  rejected, 

2.  testing  the  equality  of  the  diagonal  elements  in  the  pooled 
variance-covariance  matrix. 

The  first  test  is  performed  by  the  DISCRIM  procedure  in  SAS  (Barr, 
et  al.,  1976).   The  second  test  was  done  using  Bartlett's  test  for 
homogeneity  of  variance.   It  was  recognized  that  this  test  is  sensitive 
to  violations  of  normality  assumptions.   However,  a  visual  inspection 
of  the  frequencies  within  each  subclassification  revealed  no  serious 
departure  from  normality. 

Using  the  point  estimates  of  the  variance  components,  one  can 
derive  the  formulas  for  any  desired  coefficient  of  generalizability, 
where  generalization  is  intended  to  any  subset  of  the  universe  of 
generalization  used  in  this  study.   The  generalizability  coefficient, 

p  (X,u),  is  defined  as  the  ratio  of  a2(u),  the  universe  score  variance, 

2 
to  E(o  (X)),  the  expected  value  of  the  observed  score  variance,  the 

expectation  taken  over  repeated  applications  of  this  design. 

The  universe  of  generalization  determines  what  constitutes  the 

universe  score  variance  and  the  expected  observed  score  variance.   The 

expected  observed  score  variance  is  always  made  up  of  the  universe  score 

2 
variance  plus  error  variance  (Cronbach's  0^(6)). 


For  deviation  scores,  the  expected  observed  score  variance  is 
(17) 
E(a2(x))  =  ct2(tt)  +  g2QB)  +  a20ry)  +  a2(6ir)  +  a2(RYTr)  +  o2(B0tt)  + 


nmno      nm"r 


a2(y9Tr)  a2(By0Ti) 


or     m  o  r 

It  includes  all  of  the  components  of  variance  involving  the  student 
effect.   Other  components  in  the  model  do  not  enter  into  the  expected 
observed  score  variance  because  they  are  constant  for  all  students,  and 
in  the  formula,  the  students  are  considered  in  relation  to  the  group's 
universe  score,   Each  component  is  divided  by  the  number  of  conditions 
entering  the  facet  involved  on  that  component.   Given  the  formula  for 
the  expected  observed  score  variance,  the  universe  score  variance  may 
be  obtained  by  taking  the  limit  of  E(o  (X))  as  the  number  of  conditions 
approaches  infinity.   This  is  the  case  when  generalization  is  intended 

to  an  infinite  number  of  levels,  where  all  terms  but  a  (it)  disappear 

2 
from  the  formula.   Thus,  a  (it)  is  the  universe  score  variance.   In  the 

situation  where  generalization  is  intended  to  a  fixed  number  of 
conditions  for  a  particular  facet,  the  component  involving  that  facet 
is  considered  as  part  of  the  universe  score  variance. 

In  Table  2,  seven  coefficients  are  suggested  for  all  possible 
combinations  of  fixed  and  infinite  generalizations  across  the  three  facets. 
These  formulas  may  be  used  in  a  D-study  involving  a  similar  population 
of  subjects  and  a  subset  of  these  facets  by  substituting  the  values  for 
the  n's  for  that  study  and  these  estimates  of  the  variance  components. 
The  denominator  of  the  formulas  is  the  same  for  all  universes.  The  terms 

have  been  rearranged  so  that  the  error  component  is  within  the  parenthesis. 

2 
The  Error  Variance  a  (A) 


The  coefficients  of  generalizability  exclude  systematic  facet 
components  from  the  error  term  and,  therefore,  from  the  expected 
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observed  score  variance.   Two  situations  may  arise  where  these  variance 
components  should  be  considered  as  part  of  the  error  component.   These 
are: 

1.  Studies  where  the  conditions  of  a  facet  are  nested  within  the 
student,  rather  than  crossed.   In  other  words,  a  different  condition 
or  set  of  conditions  is  sampled  for  each  student. 

2.  Situations  which  involve  determining  confidence  intervals 
around  as  individual's  score  for  the  purpose  of  making  an  absolute 

decision. 

2 
The  formulas  for  estimating  a  (A)  from  this  design  for  different 

universes  of  generalization  may  be  obtained  from  the  information  in 

Table  3.   The  entries  in  the  table  indicate  those  components  which 

enter  into  the  error  variance.   The  components  are  to  be  divided  by 

the  frequencies  shown  in  the  last  column  of  the  table. 

Summary 

A  total  of  104  fourth  grade  students  in  eight  classes  participated 
in  this  study.   Samples  of  compositional  writing,  in  two  different 
writing  modes,  were  collected  on  three  occasions.   The  samples  were 
scored  by  four  trained  raters  using  an  8-point  general  impression  method. 

The  design  used,  a  split-plot  factorial,  considered  the  students 
as  nested  in  the  classes  and  crossed  with  the  raters,  modes,  and  occasions. 
A  model  was  constructed  which  expressed  the  variance  among  all  observa- 
tions as  a  linear  combination  of  independent  variance  components. 

Estimates  of  the  variance  components  in  the  model  were  obtained 
using  the  MIVQUEO  method  in  SAS.   This  procedure  is  applicable  to 
unbalanced  designs  such  as  the  one  considered  in  this  study.   Prior  to 
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using  the  estimates  of  the  variance  components  in  the  estimation  of 
generalizability  coefficients,  tests  for  homogeneity  of  variance  were 
performed. 

Formulas  for  generalizability  coefficients  corresponding  to  seven 
universes  of  generalization  were  provided.   in  addition,  components  of 
variance  entering  the  formulas  for  the  standard  error  of  measurement 
were  listed  for  seven  universes  of  generalization.   These  universes 
represented  generalization  across  one  dimension  (raters,  modes,  or 
occasions),  two  dimensions  (raters  and  modes,  etc.),  or  three  dimensions 
(raters,  modes,  and  occasions). 


CHAPTER  IV 
RESULTS 

This  study  was  designed  to  apply  the  principles  of  generaliza- 
bility  theory  to  the  assessment  of  writing  ability  in  young  children. 
Samples  of  writing  from  fourth  grade  children  were  collected  in  two 
modes  at  each  of  three  occasions  during  the  school  year.   A  general 
impression  method  of  scoring  was  used  by  four  trained  raters. 

Because  the  children  were  nested  in  the  classes,  the  observations 
were  first  considered  in  a  split-plot  factorial  design  with  unequal 
numbers  of  subjects  in  the  classes.   The  variance  components  for  all 
effects  in  this  model  were  estimated  using  the  MIVQUE  method. 

A  model  ignoring  the  class  dimension  was  also  considered.   For 
this  second  model,  estimates  of  the  variance  components  were  obtained 
through  the  analysis  of  variance  mean  squares.   The  results  from  these 
methods  are  reported  in  this  chapter.   Also  reported  here  are  the 

results  of  the  homogeneity  of  variance  tests  as  well  as  certain 

2 

coefficients  of  generalizability  and  error  variances,  o    (A) . 

Estimates  of  the  Variance  Components 
The  point  estimates  of  the  variance  components  in  model  (16), 
obtained  from  the  MIVQIJEO  method  of  SAS  are  reported  in  Table  4  along 
with  their  corresponding  degrees  of  freedom.   Negative  estimates  were 
replaced  by  zeros,  following  the  recommendation  of  Cronbach  et  al. 
(1972)  among  others.   These  zero  estimates  are  no  longer  unbiased 
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(Searle,  1971b,  p. 23)  and  are  obviously  bad  estimates  since  a  variance 
is,  by  definition,  non-negative. 

Searle  (1971b)  suggested  six  courses  of  action  to  follow  when 
negative  estimates  of  variance  components  are  obtained.   Three  of 
these  alternatives  involve  assuming  that  the  true  value  is  zero.   The 
first  one  is  to  report  the  negative  estimate  but  use  it  as  evidence  that 
the  true  value  is  zero.   The  second  one  is  to  change  the  negative  estimate 
to  zero,  as  was  done  in  this  study.   The  third  involves  ignoring  the 
negative  components  from  the  model  and  reestimating  the  other  components. 
The  fourth  is  to  use  the  negative  estimate  as  an  indication  of  an 
inappropriate  model  for  the  data  and  to  reconsider  the  model,  possibly 
considering  models  with  finite  instead  of  infinite  populations.   The 
fifth  course  of  action  is  to  use  Bayesian  or  maximum  likelihood 
estimators.   The  last  recommendation  suggested  by  Searle  is  "the  statis- 
tician's last  hope",  to  collect  more  data. 

As  shown  in  Table  4,  seven  out  of  the  23  estimates  are  considered 
to  be  zero.   The  actual  estimates  were  very  small.   In  general,  all 
estimates  of  the  variance  components  were  small.   This  may  be  partially 

due  to  the  restricted  range  imposed  by  the  1  to  8  rating  scale, 

~2 
The  largest  estimates  were  for  the  student  effect  (a  (tt)  =  .346), 

the  student-by-mode-by-occasion  interaction  (o  (Byr)  =  .339),  and  the 

student-by-mode-by-occasion-by-rater  interaction  which  is  confounded 

-2 
with  the  error  (a    (ByBir)  =  .235),   Following  in  order  of  magnitude  were 

~2 
the  student  by  occasion  interaction  (a  Cytt)  =  ,073)  and  the  occasion 

~2 
main  effect  fa  (y)  =  ,070),   All  other  estimates  appear  negligible. 
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TABLE  4 


POINT  ESTIMATES  OF  THE  VARIANCE  COMPONENTS 
FOR  THE  MODEL  (16) 


VARIANCE  COMPONENT 


df 


POINT  ESTIMATE 


32 

(a) 

~2 

0 

Cir) 

0 

(B) 

~2 

CaB) 

"  9 

CB-rO 

82 

:y) 

52 

>y) 

82 

^Y  tO 

52 

;e) 

"2 

0' 

;a9) 

52 

e  it) 

52 

By) 

82( 

aBY) 

82< 

By  it) 

"2, 

a 

Hi  ) 

82( 

aB6) 

82( 

Be  u) 

"2/ 

0-  1 

Y8) 

"  2  ^ 
0^  1 

ay3  ) 

'2  r 

0     1 

y8  tt) 

"2  1 
0    I 

By3) 

82( 

aBv6  ) 

32( 

By6  tt) 

7 

96 

1 

7 

96 

2 

14 

192 

3 

21 

288 

2 

14 

192 

3 

21 

288 

6 

4  2 

576 

6 

42 

576 


0.000* 

0.346 

0.000* 

0.000* 

0.024 

0.070 

0.000* 

0.073 

0.008 

0.000* 

0.002 

0.010 

0.056 

0.339 

0.000* 

0.002 

0.021 

0.003 

0.000* 

0.007 

0.006 

0.017 

0.235 


'Negative  estimate  has  been  replaced  by  zero 
Note:   a  =  classes,  T]  _  students,  B  =  modes,  y=   occasions,  6=  raters, 
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Test  of  Homoscedasticity  Assumption 
The  generali zability  coefficients  obtained  from  the  intraclass 
correlation  formulas  are  unbiased  only  if  homogeneity  of  variance 
assumptions  are  met.   To  test  this  assumption  in  the  context  of  the 
split-plot  factorial  design,  a  procedure  described  by  Kirk  (1968), 
pp. 258-261)  was  used.   The  test  for  the  equality  of  the  eight  variance- 
covariance  matrices  (corresponding  to  the  eight  classes)  resulted  in  a 
chi- square  value  of  35.11.   With  2100  degrees  of  freedom,  the  observed 
chi-square  was  not  significant  at  the  .10  level. 

Since  the  eight  matrices  were  not  significantly  different,  a 
pooled  variance-covariance  matrix  was  constructed.   Testing  for  the 
equality  of  the  diagonal  elements  in  the  pooled  matrix  resulted  in  a 
chi-square  value  of  50.78,  which  was  not  statistically  significant  at 
the  ,10  level  with  23  degrees  of  freedom.   This  result  indicated  that 
differences  among  the  diagonal  elements  in  the  pooled  matrix  were  not 
statistically  significant.  The  results  from  these  two  tests  lent 
support  to  the  homogeneity  of  variance  assumption. 
Generalizability  Coefficients 
The  coefficients  reported  in  this  section  were  obtained  by  sub- 
stituting the  point  estimates  from  Table  4  into  the  formulas  derived 
in  Table  2.   Forty  nine  coefficients  were  estimated,  corresponding  to 
seven  different  universes  of  generalization  and  seven  different  com- 
binations of  condition  frequency.   These  coefficients  are  reported  in 
Table  5.   The  first  five  represent  combinations  which  yield  a  total  of 
24  observations  on  each  person.   Within  that  restriction,  the  combina- 
tions are  included  to  show  which  facet  needs  to  be  sampled  most  frequently, 
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The  last  two  combinations  are  included  to  show  the  effect  on  the 
coefficients  of  minimum  sampling. 

The  smallest  coefficient  obtained,  ,330,  corresponds  to  a  situation 
where  generalization  is  intended  across  raters,  modes,  and  occasions 
but  each  facet  is  sampled  only  once.   This  situation  may  occur  if  a 
classroom  teacher  were  to  base  the  student's  writing  scores  for  the 
year  on  one  sample  of  writing. 

For  the  same  universe  of  generalization,  increasing  the  number  of 
conditions  for  the  mode  and  occasion  facets  by  one,  results  in  an 
increased  coefficient  of  ,624.   The  highest  coefficient  for  that 
universe,  .834,  is  obtained  when  six  conditions  for  the  occasion  facet 
are  sampled  and  the  rater  and  mode  facets  are  each  sampled  twice. 

As  the  universe  of  generalization  is  restricted,  by  fixing  one  or 
more  facets,  the  general izabi lity  coefficients  tend  to  increase.   In 
all  universes,  the  smallest  coefficients  are  found  when  only  one 
condition  of  each  facet  is  sampled. 

In  the  three  universes  having  only  one  facet  fixed,  the  highest 
coefficients  correspond  to  the  two  situations  where  the  mode  by  occasion 
combinations  are  sampled  the  most.   In  the  last  universe,  where 
generalization  is  intended  across  raters  only,  all  the  coefficients  are 
high. 

The  Error  Variance  a    (A) 


The  variance  components  were  also  used  in  estimating  the  error 

2 
variance  o    (A),  the  square  root  of  which  may  be  used  for  obtaining 

confidence  intervals  around  an  individual's  universe  score.   Several 

2 

components,  a~ (A) ,  were  estimated  corresponding  to  the  seven  different 

universes  of  generalization.   The  results  of  this  estimation  are 
presented  in  Table  t>.   For  each  universe  of  generalization,  seven 
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estimates  are  included.   These  estimates  correspond  to  different  sampl- 
ing combinations.   The  first  five  combinations  yield  a  total  of  24 
observations.   The  last  two  represent  minimal  sampling  of  conditions 
within  each  facet. 

As  shown  on  the  table  for  the  first  three  universes,  and  again 
for  the  fifth, in  the  column  where  two  raters,  two  modes,  and  six 
occasions  are  sampled,  the  error  variance  is  at  a  minimum,   For  the 

fourth  and  sixth  universes,  the  second  combination  is  the  one  which 

~2 
minimizes  a    (A).   In  the  last  universe,  the  first  five  combinations 

yield  small  error  variances. 

Supplementary  Analysis 

Since  five  of  the  seven  negative  estimates  obtained  were  associated 
with  the  classes  effect,  a  ,  a  follow-up  analysis  was  done  eliminating 
the  classes  from  the  model.   Dropping  the  classes  resulted  in  a  four- 
way  balanced  factorial  design  without  replications.   This  was  one  of 
the  designs  considered  by  Medley  and  Mitzel   (1963).   For  this  design, 
the  point  estimates  of  the  variance  components  were  obtained  using  the 
mean  squares  from  the  analysis  of  variance  reported  in  Table  7.   These 
mean  squares  were  substituted  into  the  formulas  for  the  point  estimates 
given  by  Medley  and  Mitzel   (1963,  p. 312). 

The  resulting  estimates  of  the  variance  components  are  reported 
in  Table  8.   As  shown  in  Table  8,  three  of  the  15  point  estimates  were 
negative  and  have  been  replaced  by  zeros,   Of  these,  only  the  estimate 
of  the  student-by-rater  interaction  had  been  positive  in  Table  4,   The 
ratio  of  negative  estimates  to  the  total  number  of  estimates  is  smaller 
for  the  model  without  the  classes  effects  than  for  the  initial  model 
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TABLE  7 

ANALYSIS  OF  VARIANCE 

FROM  A  FOUR-WAY  FACTORIAL 

DESIGN  WITHOUT  REPLICATIONS 

STUDENT(S)  X  MODE(M)  X  OCCASION(O)  X  RATER(R) 


Source 


df 


SS 


MS 


s 

103 

M 

1 

0 

2 

R 

3 

S  x  M 

103 

S  x  0 

206 

S  x  R 

309 

M  x  0 

2 

M  x  R 

3 

0  x  R 

6 

S  x  M 

X 

0 

206 

S  x  M 

X 

R 

309 

S  x  0 

X 

R 

618 

M  x  0 

X 

R 

6 

S  x  M 

X 

0  x  R 

618 

1142.285 

11.090 

0.673 

0.673 

150.337 

75.169 

12.956 

4.319 

210.202 

2.041 

500.079 

2.428 

100.335 

.325 

21.073 

10.536 

0.149 

0.050 

8.211 

1.368 

387.677 

1.882 

101.143 

0.327 

161.372 

0.261 

5.821 

0.970 

152.762 

0.247 
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TABLE  8 


POINT  ESTIMATES  OF  THE  VARIANCE  COMPONENTS  FOR 
THE  FOUR-WAY  FACTORIAL  WITHOUT  REPLICATIONS 


VARIANCE  COMPONENT* 


POINT  ESTIMATE 


0 

» 

-2 

0 

;b) 

A  2 
a 

;btt) 

~2 

a 

:y) 

„2 
a 

Yir) 

-2 

a 

P) 

-2 
a 

6   7r) 

-2 

a 

By) 

^9 
a" 

Bytt) 

-2 

0 

B3) 

~2 

0 

B3  n) 

-2 

0 

V3) 

-2 

a 

Y6  it) 

-2 

0 

Bye ) 

-2 

0 

.By3   ti) 

0.355 

0.000** 

0.006 

0.076 

0.066 

0.006 

0.000** 

0.019 

0.409 

0.000** 

0.027 

0.002 

0.007 

0.007 

0.007 


*The  same  notation  used  for  model  (16)  will  be  used  here. 
:*Negative  estimate  has  been  replaced  by  zero. 
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including  the  classes  effects.  Therefore,  using  negative  estimates  as 
the  criterion,  it  appears  that  eliminating  the  effects  involving  classes, 
a  ,  from  model  (16)  results  in  a  better  model  for  these  data. 

The  estimates  in  Table  4,  obtained  by  the  MIVQUEO  method,  and  those 
in  Table  8  obtained  through  the  analysis  of  variance  mean  squares, 
are  very  close.   The  similarity  between  the  estimates  obtained  from  the 
two  different  methods   lends  support  to  the  validity  of  the  MIVQUE  as 
a  useful  method  when  the  data  are  unbalanced.   The  analyses  of  variance 
approach,  as  was  mentioned  earlier,  is  universally  accepted  as  the  best 
method  for  balanced  data. 

Summary 
Point  estimates  for  all  variance  components  in  the  model  were 
obtained  and  reported  in  Table  4.   Negative  estimates  were  replaced  by 
zeros.   The  magnitude  of  the  estimates  indicated  that  students  could 
be  differentiated  on  the  basis  of  their  ratings.   However,  the  classes 
as  units  could  not  be  distinguished.   Of  the  three  sources  of  error 
examined,  the  occasion  facet  constituted  the  greatest  source.   The  mode 
facet  was  next  in  magnitude.   Raters  represented  an  insignificant 
source  of  errors. 

The  tests  of  homogeneity  of  variance  lent  support  to  the  assumption 
that  the  variances  within  each  condition  combination  were  equal. 
Assuming  homogeneity  of  variance,  unbiased  generalizability  coefficients 
were  obtained  for  seven  universes  of  generalization.   These  universes 
represented  generalization  across  one  facet,  two  facets,  or  all  three 
facets  simultaneously.   For  each  universe,  seven  coefficients  were 
computed  for  possible  D-studies  with  various  combinations  of  condition 
frequencies.   For  most  universes,  the  coefficients  indicated  that  to 
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obtain  acceptable  levels  of  general izability  at  least  six  samples  of 
writing  from  each  person  are  necessary.  The  only  exception  was  when 
generalization  was  intended  across  raters  only.   The  results  for  the 
error  corresponding  to  the  standard  error  of  measurement,  were  similar 
to  those  based  on  the  generalizability  coefficients, 

A  supplementary  analysis  which  compared  the  estimates  obtained 
through  the  MIVQUE  method  to  those  derived  using  expected  mean  squares, 
resulted  in  similar  values  for  all  estimates  in  a  model  without  the 
classes  effect.   These  results  were  interpreted  as  lending  support  to 
the  validity  of  the  MIVQUE  method. 


CHAPTER  V 

DISCUSSION 
In  this  study,  generalizability  theory  was  applied  to  the  assessment 
of  writing  ability  in  young  children.   A  universe  of  generalization  was 
defined  in  terms  of  three  facets:   modes,  occasions,  and  raters.   Samples 
of  children's  writing  performance  were  obtained  under  selected  conditions 
from  each  facet.   The  design  permitted  the  investigation  of  three  main 
sources  of  error  and  their  interactions.   These  sources  were  considered 
to  affect  the  inference  of  writing  ability  from  writing  performance. 
Formulas  for  generalizability  coefficients  were  derived  for  seven 
universes  of  generalization. 

The  first  two  sources  of  error  were  defined  in  terms  of  variability 
in  the  quality  of  the  writing  samples.   This  variability  may  result  from 
changes  in  the  subject's  performance  across  time  (occasions)  and  across 
assignment  (modes).   The  third  source  of  error  may  result  from  differences 
in  the  standard  of  judgement  used  by  different  raters  when  scoring  the 
samples.   Using  the  principles  of  generalizability  theory,  the  relative 
contributions  of  these  sources  of  error  were  examined  via  estimates  of 
the  variance  components.   The  discussion  of  the  results  is  focused  on 
the  interpretation  of  the  variance  components  and  the  usefulness  of 
the  theory.   The  limitations  of  this  and  similar  studies  are  also 
considered. 
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Interpretation  of  Variance  Components 
The  largest  component  of  variance  was  that  associated  with  the 
students,  indicating  that  it  was  possible  to  rank  order  the  students 
on  the  basis  of  their  ratings.   This  component  represented  the 
universe  score  variance.   The  classes  component,  on  the  other  hand, 
was  considered  to  be  zero  (the  actual  estimate  was  negative),  indicating 
that  the  eight  classes  could  not  be  differentiated  as  units  on  the 
basis  of  the  ratings  received  bv  the  students.   All  but  three  components 
of  interactions  involving  the  classes  were  also  zero.   The  three  non- 
zero components  were:   the  class-by-mode-by-occasion,  .056;  the  class- 
by-mode-by-rater,  .002;  and  the  interaction  of  the  classes  with  all 
three  facets,  .017. 
Generalization  Across  One  Facet 

The  point  estimates  for  the  student-by-facet  interactions  for  the 
mode,  occasion,  and  rater  facets  were  .024,  .073,  and  .002,  respectively. 
These  interaction  components  reflect  the  relative  contribution  of  each 
source  of  error  when  generalization  is  done  along  that  one  dimension 
only.   No  interaction  would  mean  that  students  are  similarly  rank 
ordered  across  all  conditions  of  that  facet,  thus  generalization  across 
all  conditions  would  be  possible.   On  comparing  these  three  estimates, 
it  appears  that  occasions  represented  the  greatest  source  of  error  while 
raters  represented  the  smallest.   The  large  relative  contribution  of 
occasions  to  error  means  that  students  are  not  ranked  in  the  same  manner  for 
all  three  occasions.   Differential  learning  might  have  taken  place  during 
the  school  year.   An  implication  is  that  when  making  an  assessment  of 
writing  ability,  it  is  important  to  note  when,  during  the  year,  the 
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measure  was  obtained.   If  generalization  is  intended  across  different 
occasion  conditions,  then  several  conditions  should  be  sampled. 

The  small  component  associated  with  the  student-by-rater  interac- 
tion indicates  that  the  four  raters  ranked  the  students  similarly. 
It  seems  possible,  then,  to  train  raters  in  applying  the  general  impres- 
sion scoring  method  systematically.   Since  this  scoring  method  is  both 
fast  and  efficient,  large  scale  projects  could  confidently  take 
advantage  of  it.   It  is  important  to  remember  that,  after  scoring 
several  papers,  the  raters  discussed  those  samples  which  received 
differing  scores.  Thus,  it  is  not  surprising  that  this  source  of 
error  was  minimal. 

The  student-by-mode  interaction  was  large  enough  to  indicate  that 
changes  in  the  task  may  result  in  different  rankings  of  students. 
Different  modes  of  writing  may  demand  different  abilities  from  the 
students.   A  piece  of  creative  writing,  for  example,  would  require  an 
exercise  of  the  imagination  while  writing  a  report  would  require  the 
ability  to  organize  facts  in  a  meaningful  fashion. 

The  three  main  effect  components  associated  with  modes,  occasions, 
and  raters  were  .000,  .070,  and  .008,  respectively.   These  components 
reflect  systematic  changes  and  contribute  to  error  only  if  absolute 
decisions  are  being  made  or  when  different  conditions  are  sampled  for 
different  students.   Again,  the  occasion  component  is  the  largest, 
indicating  that  the  overall  ratings  were  greater  on  some  occasions  than 
in  others.   It  is  possible  that  all  students  improved  their  writing 
performance  during  the  school  year.   The  rater  component  is  small  but 
higher  than  the  student-by-rater  interaction.   This  component 
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reflects  any  systematic  rater  bias.   There  appeared  to  be  no  systematic 

variability  due  to  modes. 

Generalization  Across  Two  or  Three  Facets 

When  generalization  is  intended  across  more  than  one  dimension,  in 
addition  to  the  components  discussed  in  the  previous  section,  those 
components  involving  the  interactions  among  facets  must  be  considered. 
The  three-way  interaction  components  involving  the  students  and  two 
facets  were  .339,  .021,  and  .007  for  the  mode-occasion,  mode-rater,  and 
occasion-rater  combinations,  respectively.   The  first  one  is  relatively 
large,  almost  equal  in  magnitude  to  the  student  component.   The 
interpretation  of  that  component  is  that  differences  is  students' 
ranking  across  the  mode  conditions  change  as  a  function  of  the  occasion 
conditions.   A  large  component  indicates  that,  when  generalization  is 
intended  across  these  two  facets,  the  conditions  should  be  sampled 
frequently,  if  error  is  to  be  minimized.   This  fact  is  reflected  in 
Table  7  where  coefficients  of  generalizability  are  shown  for  several 
condition  combinations.   The  largest  coefficients  correspond  to 
situations  where  modes  and  occasions  are  sampled  most  frequently. 

The  student-by-mode-by-rater  component  reflects  some  variability 
due  to  differential  ranking  of  students  by  the  raters  as  a  function  of 
the  mode.   That  is,  raters  were  not  as  consistent  in  one  mode  as  they 
were  in  the  other.   The  small  student-by-occasion-by-rater  interaction 
indicates  that  raters  were  almost  as  consistent  in  one  occasion  as  they 
were  in  the  others. 
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The  two-way  interaction  components  among  facets  were  .010,  .000, 
and  .003  for  the  mode-by-occasion,  mode-by-rater,  and  occasion-by-rater 

components,  respectively.   These  components  enter  into  the  error 

2  2 

variance  a  (A)  but  not  a  (6) .   Of  these,  only  the  mode-by-occasion 

component  is  large  enough  to  warrant  consideration.   This  component 

indicates  that  differences  in  the  overall  ratings  across  modes  vary  as 

a  function  of  the  occasion.   For  example,  it  is  possible  that  all  students 

performed  better  when  writing  the  creative  story  at  the  beginning  of 

the  year.   On  the  other  hand,  at  the  end  of  the  year  they  might  have 

done  a  better  job  on  the  factual  reports.   If  all  students  had  more 

practice  in  one  mode  during  the  year,  their  improved  ability  in  that 

mode  would  be  reflected  in  this  component. 

When  generalizing  across  all  three  facets,  two  additional  components 
of  variance  must  be  considered.   The  four-way  interaction  component 
involving  students  and  all  three  facets  was  relatively  large,  .235. 
Since  there  were  no  replications  within  any  three  facet  combination, 
this  component  was  confounded  with  the  error  of  replication.   The 
magnitude  of  this  component  indicates  that  generalization  across  all 
three  facets  requires  that  more  than  one  condition  of  at  least  one  facet 
be  sampled  in  order  to  minimize  the  error.   The  three-way  interaction 
component  among  the  three  facets  was  relatively  small,  .006. 

Based  on  the  previous  discussion,  it  may  be  concluded  that  the 
occasion  facet  represented  a  greater  source  of  error  than  the  mode 
facet.   The  mode  facet,  in  turn,  represented  a  greater  source  of  error 
than  the  rater  facet.   With  proper  training  and  practice,  the  rater 
facet  may  be  almost  irrelevant.   These  findings  agree  with  those  of 
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Finlayson  (1951)  and  Vernon  and  Millican  (1954)  who  concluded  that 
differences  in  essays  contributed  more  to  unreliability  than  differ- 
ences in  raters.   The  differences  in  essay  were  further  investigated 
in  this  study,  since  essays  were  characterized  along  two  dimensions. 
Both  of  those  dimensions  were  found  to  be  important  in  this  study. 
Furthermore,  one  of  them  was  found  to  be  more  important  than  the  other. 

These  findings  also  support  the  recommendations  made  by  experts 
in  the  field  of  language  arts  and  discussed  in  Chapter  II.   To  obtain 
a  reliable  assessment  of  writing  ability  more  than  one  sample  of 
writing  should  be  collected  on  more  than  one  occasion  and  on  more  than 
one  mode.   How  many  is  more  than  one?  That  depends  on  the  intended 
universe  of  generalization. 

An  examination  of  Tables  7  and  8  provides  some  guidelines  for 
answering  that  question.   In  those  tables  seven  universes  of  generaliza- 
tion are  considered.   The  first  universe  represents  generalization  across 
all  three  facets.   The  next  three  reflect  generalization  across  two 
facets  only,  the  third  facet  is  held  constant.   The  last  three  universes 
correspond  to  generalization  across  one  facet  only:   occasions,  modes, 
and  raters,  in  that  order.   Several  condition  combinations  are  included 
in  each  table. 

The  entries  in  Table  7  represent  generalizability  coefficients 

obtained  via  intraclass  correlation  formulas.   The  error  variance 

2 
entering  into  those  coefficients  is  a  (6).   In  general,  the  highest 

coefficients,  across  all  seven  universes,  correspond  to  situations  where 
12  writing  samples  are  collected  (the  second  and  fourth  condition  combi- 
nations).  Collecting  six  writing  samples  (first,  third,  and  fifth 
condition  combinations)  results  in  a  decrease  in  the  coefficients. 
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However,  the  decrease  is  not  too  drastic,  except  perhaps  in  situations 
where  all  six  samples  are  collected  in  one  occasion.   This  situation 
seems  unrealistic  since,  in  this  case,  writer  fatigue  would  interfere 
with  writing  ability.   If  only  four  samples  are  collected  and  only  one 
rater  is  used,  the  coefficients  drop  below  .7  for  most  universes.   With 
only  one  sample,  as  shown  in  the  last  condition  combination,  most 
coefficients  would  be  unacceptable. 

The  entries  in  Table  8  represent  the  estimates  of  the  error  variance 

2 
a    (A)  which  takes  into  account  systematic  effects.   The  square  root  of 

the  entries,  a  (A),  represents  the  standard  error  of  measurement.   Thus, 
the  information  in  Table  8  may  be  used  in  constructing  confidence 
intervals  around  individuals'  true  scores.   In  general,  the  conclusions 
that  may  be  made  based  on  the  results  shown  in  this  table  are  similar 
to  those  based  on  Table  7.   That  is,  for  these  estimates,  those  condition 
combinations  which  maximize  p"(x,u),  also  minimize  a  (A). 
Usefulness  of  Generalizability  Theory 
On  the  basis  of  this  study  it  may  be  said  that  generalizability 
theory  provides  a  useful  method  for  estimating  the  reliability  of 
measures  of  writing  ability.   With  a  clear  definition  of  error  and  using 
repeated  studies,  it  might  have  been  possible  to  examine  certain 
reliabilities  of  essay  using  classical  methods.   Those  reliabilities 
which  include  components  of  interactions  among  facets  would,  of  course, 
be  impossible  to  obtain  under  classical  methods.   For  those  reliabilities 
which  are  estimable  under  classical  methods,  the  treatment  would  be 
more  awkward.   The  basic  requirement  under  the  framework  of  generaliza- 
bility theory  is  that  the  source  of  error  be  identified  as  a  facet  and 
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that  conditions  of  that  facet  be  sampled  and  incorporated  into  the 
design.   In  that  manner,  the  components  of  variance  associated  with 
that  source  are  estimable.   Including  facets  in  a  design  is  a 
popular  method  of  control  in  educational  research  since,  typically, 
this  kind  of  research  takes  place  in  the  natural  setting.   It  follows 
that  generalizabil ity  theory  provides  a  practical  methodology  in 
those  situations. 

Given  the  applicability  of  the  theory  to  problems  of  reliability, 
it  is  surprising  that  applications  of  it  are  scarce  in  the  literature. 
Some  possible  explanations  of  this  situation  are  considered  here. 
These  are:   (a)  the  unfamiliarity  of  applied  educational  researchers 
with  the  methods,  (b)  the  unavailability  of  formulas  for  more  complex 
designs,  or  (c)  the  limitation  imposed  by  the  restriction  of  balance. 

This  application  of  the  theory  is  a  step  in  making  the  methods 
more  familiar  to  a  wider  group  of  applied  educational  researchers.   In 
particular,  researchers  in  the  field  of  compositional  writing  have  been 
provided  with  estimates  of  variance  components  which  may  be  useful  in 
the  planning  of  both  comparative  and  absolute  D-studies  in  that  area. 
In  addition,  formulas  for  the  generalizability  coefficients  have  been 
derived  for  the  design  used  in  this  study.   Those  formulas  may  be  adapted 
to  fit  other  designs  which  represent  subsets  of  our  universe  of  admissible 
observations.   All  that  would  be  required  is  that  those  terms  involving 
facets  not  included  in  the  design  be  dropped  from  the  formula. 

As  was  demonstrated  in  this  study,  the  restriction  of  balance  is 
not  necessary.   Several  methods  are  available  for  the  estimation  of 
variance  components  in  unbalanced  designs.   One  of  those  methods  was 
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used  in  this  study.   Computer  programs  in  SAS  may  be  used  to  obtain  the 
point  estimates.   The  procedure  available  in  the  1976  version  of  SAS 
uses  Henderson's  method  3.   A  future  version  of  SAS  will  include,  in 
addition  to  the  current  method,  the  MIVQUEO  method  which  was  used  in 
this  study.   The  point  estimates  obtained  from  the  MIVQUEO  method  were 
very  similar  to  those  obtained  for  a  reduced  model  via  expected  mean 
squares.   These  results  were  presented  in  the  supplementary  analysis  of 
the  previous  chapter.   Future  research  should  focus  on  comparing  the 
"goodness"  of  these  different  methods  when  applied  to  specific  situations. 

These  computer  programs  have  certain  limitations  when  large  design 
matrices  are  involved.  For  large  design  matrices,  such  as  the  one  used 
in  this  study,  the  current  SAS  program  requires  an  excessive  amount  of 
computer  space  and  time.  For  example,  approximately  five  hours  would  have 
been  required  to  get  the  point  estimates  for  the  components  in  this  study 
under  the  current  version.  The  MIVQUEO  method  uses  less  time  and  memory 
but  for  large  design  matrices  it  still  represents  an  expensive  process. 

However,  the  estimates  of  the  variance  components  from  one  G- 
study  may  be  used  in  subsequent  D-studies  involving  a  similar  population 
of  individuals  and  similar  facets.   The  estimates  computed  for  this 
study  may  be  useful  to  persons  working  with  fourth  grade  students  of 
similar  characteristics.   A  limitation  is  introduced  by  the  high  rate 
of  attenuation  in  this  sample.   To  the  extent  that  the  final  sample  is 
representative  of  the  fourth  grade  population,  our  estimates  are  useful. 

An  additional  limitation  of  this  study  is  introduced  by  the  small 
number  of  conditions  sampled  within  each  facet.   As  has  been  pointed  out 
by  Henderson,  among  others,  the  sampling  error  of  the  estimates  of 
variance  components  is  large  when  few  conditions  are  used  in  the  estimate. 
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On  a  different  application  of  this  design,  then,  it  is  possible  that 
the  estimates  obtained  would  vary  from  the  ones  in  this  study.   As 
the  number  of  degrees  of  freedom  increases,  the  accuracy  of  the  estimate 
also  increases.   It  should  be  noted  that  the  components  used  in  the 
generalizability  coefficients  have  large  numbers  of  degrees  of  freedom 
since  they  involve  the  student  effect. 

Summary  and  Conclusions 
This  study  examined  the  problem  of  reliability  of  measures  of 
writing  ability  in  the  context  of  generalizability  theory.   Three  main 
sources  of  error  variance  were  considered:   raters,  modes,  and  occasions. 
It  may  be  concluded  that  errors  resulting  from  variability  in  the 
quality  of  writing  across  occasions  and  modes  outweigh  those  stemming 
from  differences  among  raters.   With  training  and  practice,  raters  can 
consistently  score  the  writing  samples  of  students  using  a  general 
impression  method.   This  method  proved  to  be  both  fast  and  easy  to  use. 
To  improve  the  reliability  of  measures  of  written  composition  and 
decrease  the  standard  error  of  measurement,  the  emphasis  should  be 
placed  on  collecting  several  samples  of  writing.   On  the  basis  of  the 
estimates  obtained  in  this  study,  collecting  less  than  six  samples  would 
result  in  coefficients  below  .70.  Assessing  the  reliability  of  measures 
of  writing  ability  in  terms  of  rater  agreement,  is  skimming  the  problem. 
It  is  unfortunate  that  this  issue  is  most  commonly  addressed  in  terms 
of  inter-rater  reliability. 

This  study  demonstrated  the  potential  of  generalizability  theory 
for  clarifying  problems  of  reliability.   In  applying  the  theory,  the 
careful  identification  of  potential  sources  of  error  is  required.   Also, 
consideration  must  be  given  to  the  type  of  inference  which  is  to  be  made 
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from  the  observations.   On  the  basis  of  these  considerations,  the 
universe  of  observations  is  defined.   A  carefully  designed  study  will 
allow  the  estimation  of  all  sources  of  error  variance  identified.   As 
was  shown  in  this  study,  it  is  not  necessary  to  limit  applications  of 
the  theory  to  balanced  designs.   Methods  of  variance  component  estima- 
tion for  unbalanced  designs  are  documented  in  the  statistical  literature 
and  available  in  SAS,  a  popular  package  of  statistical  programs. 
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APPENDIX  A 

POINT  ESTIMATES  OF  THE  VARIANCE  COMPONENTS  AS 
LINEAR  COMBINATIONS  OF  MEAN  SQUARES  FOR 
THE  SPLIT-PLOT  FACTORIAL  DESIGN  WITH  BALANCED  DATA 

a2(a)  =  l/ns(c)nmn0nr[  'MS(a)-MS(TT)-MS(aB)-MS(aY)-MS(a6)+MS(iTB) 

+MS  (7ry) +MS  Ot6) +MS  (aBy) +MS  (aB9) +MS  (ay8) -MS  (ttBy) -MS  (ttB0) 

-MS(7TY6)-MS(aBY6)+MS(iTBYe)]   . 
a  (tt)  =  l/nmn0nr[  MS(tt)-MS  (ttB)-MS  (Try) -MS  (Tre)+MS(TTBy)+MS(TTB0) 

+MS(ity6)-MS(ttBy6)]  . 
a2 (B)  =  l/ns  ^.  j ncn0nr [MS (B) -MS (By) -MS (B8) +MS (By0) -MS (aB) +MS (aBy) 

+MS(aB0)-MS(aBy8)]  . 
a  (aB)  =  l/ns  (C)  nQnr  [MS  (aB)  -MS  (aBy) -MS  (aB0)  +MS  (aBy0 )  -MS  (ttB)  +MS  (ttBy) 
+MS(7TBe)-MS(TTBy0)]  . 

o2(ttB)  =  l/nonr[MS(7TB)-MS(TrBy)-MS(iiB0)+MS(TTBY0)]  . 

a2  (y)  =  l/ns (C) ncnmnr [MS (y) -MS (yB) -MS (y6) +MS (By0) -MS (ay) +MS (ay8) 

+MS(aBy)-BS(aBy0)]'  . 
a"  (ay)  =  l/ns  ^ nmnr  [MS (ay) -MS  (ayB)  -MS  (ay0)  +MS (aBy0) -MS  (Try)  +MS  (iTyB) 

+MS(iTye)-MS(irBye)]  . 
a2  (tty)  =  l/nnnr[MS(Try)-MS(TTYB)-MS(TTY0)+MS(TTBye)]  . 
a  (0)  =  l/ns(c)ncnmno[MS(0)-MS(0B)-MS(0Y)+MS(By0)-MS(a8)+MS(a0B) 

+MS(a0y)-MS(aBy8)]  . 

-2 

a  (a6)  =  l/ns(c)nmno[MS(a0)-MS(a0B)-MS(a0y)+MS(aBY0)-MS(TT0)+MS(TT0B) 

+MS(tt0y)-MS(ttBy0)]  . 
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